harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

Issue with problem intervals / gvcf2DB rule #56

Closed ArsenaultResearch closed 2 years ago

ArsenaultResearch commented 2 years ago

Hi, I've been working on using the snpArcher pipeline for some variant calling and have run into an issue that I'm not really sure how to address. Most of the pipeline finished running but there seems to be some kind of hang up with two of the intervals. Every time a gvcf2DB rule is attempted with either of these intervals, the job pops an error with the log saying there is this error:

htsjdk.samtools.SAMFormatException: Invalid GZIP header

I've checked the map files for each of these intervals and they seem to look identical to all of the other ones that have successfully run. I've also checked the outputs in the 03_gvcf directory for these intervals and they seem to have completed normally (".done" file and .raw.g.vcf.gz files present). Do you all have any idea what's going wrong or any suggestions on how I can fix this issue? Thanks for the help and for the really amazing tool! -Sam

tsackton commented 2 years ago

I have seen this occasionally as well. I think this happens if there is a rare I/O issue while the g.vcf.gz file is being written to disk, which results in that file getting corrupted. The only solution I have found is to manually delete the g.vcf.gz and .done files for the intervals that seem to cause problems, and restart the pipeline.

At some point I hope to add more error checking, to validate the .g.vcf.gz and the .bam files to prevent this issue.

ArsenaultResearch commented 2 years ago

I also had to delete the DB_mapfile's for the intervals in question so that snakemake would re-run the relevant variant calling step but after that it all worked perfectly. Thanks so much!