harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

option for allele merging - VCF2TileDBException #163

Closed gghill closed 2 months ago

gghill commented 3 months ago

Hi there, Thanks for developing and supporting such an efficient and self-contained pipeline! My most recent run (94 samples) failed at gvcf2DB with an error (log 0647.txt):

terminate called after throwing an instance of 'VCF2TileDBException' what(): VCF2TileDBException : Incorrect cell order found - cells must be in column major order. Previous cell: [ 33, 487571965 ] current cell: [ 33, 487571965 ]. The most likely cause is unexpected data in the input file: (a) A VCF file has two lines with the same genomic position (b) An unsorted CSV file (c) Malformed VCF file (or malformed index)

A quick search suggests this is a common issue with GATK that can be addressed by running bcftools norm to merge the alleles at the same position prior to GATK. Is there any way to do this or otherwise address this issue in the snpArcher workflow (such as by manually fixing the vcfs and handing them back to the pipeline)? Thanks for your help, very excited to be close to a finished run and first results.

cademirch commented 3 months ago

Interesting, I think this may have come up before some time ago but was never documented, so thank you for opening an issue. I've made a branch called bcftools_norm and added a rule to normalize the merged gvcfs before the db step, using the command suggest here. Try this and see if it works.

Its interesting that this is occurring given how we concat gvcfs: https://github.com/harvardinformatics/snpArcher/blob/5b479be1753c731e1d7bf22f42f703d51ef7dcb9/workflow/rules/bam2vcf_gatk_intervals.smk#L60

bcftools concat docs say:

-d, --rm-dups snps|indels|both|all|exact
Output duplicate records of specified type present in multiple files only once. Note that records duplicate within one file are not removed with this option, for that use [bcftools norm -d](https://samtools.github.io/bcftools/bcftools.html#norm) instead.
In other words, the default behavior of the program is similar to unix "cat" in that when two files contain a record with the same position, that position will appear twice on output. With -d, every line that finds a matching record in another file will be printed only once.
Requires -a, --allow-overlaps.

-D, --remove-duplicates
Alias for -d exact

Not sure what exactly bcftools norm is doing that concat -a -D isn't, but its too late to dig into this now...

I'll try to take a look at this with some fresh eyes this weekend.

gghill commented 3 months ago

Hi Cade, Thanks for the quick response. From the documentation you posted, the only sticking point I can see (as a definite non-expert) is the removal of "exact" duplicates, maybe inexact matches are slipping through but somehow still too overlapping for GATK? I have tried another run with the branch version of the workflow, as you suggested, and while the error is different, the problem seems similar (log1, log2) as now in DB2vcf I get an error concerning overlapping variants:

terminate called after throwing an instance of 'VariantQueryProcessorException' what(): VariantQueryProcessorException : Unhandled overlapping variants at columns 519969767 and 519969768 for row 10

followed (and preceded) by a number of lines of Benchmark: unable to collect cpu and memory benchmark statistics but that doesn't seem as related (though it is the first time I've seen it).

It seems like it doesn't fail for all imported .tar files, the same genome is being used for all samples, and there hasn't been any indication so far that individual samples are problematic (though I'm not sure whether at this stage the input .tars are still individuals). All samples are present in the DB_mapfile. Please let me know if you have any advice or there is additional information I can provide.

cademirch commented 3 months ago

Shame that didn't work, looks like it moved the error one step downstream. It will be hard to debug this out without access to the problematic gVCFs unfortunately. I may suggest deleting or removing the problem gVCFs and letting the workflow remake them, perhaps something weird happened for a few while generating them.

@tsackton Have you seen this before?

tsackton commented 3 months ago

I have seen this once or twice before, usually associated though with some kind of file corruption. I'd suggest deleting the g.vcf and the interval g.vcfs and see if that helps. If not, it might reflect some problems with the merge step we'd need to investigate. You can also check and see if the file looks complete.

gghill commented 2 months ago

Just wanted to check back in to let you know after a couple false starts and a week or two of run time, the run completed using the bcftools_norm branch and a complete fresh run (changed the reference genome name to encourage it to start from scratch). The results look great, the interactive QC html is super useful and I look forward to digging in further. Thanks for your help :)

cademirch commented 2 months ago

Thanks for the update @gghill. Glad it worked. @tsackton should we merge the bcftools_norm? Seems like it might have been helpful in this case?

tsackton commented 2 months ago

Yes, that seems like a good idea