vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.11k stars 195 forks source link

vg and SV insertions: cannot set SEQ[canonicalize] #2084

Open cgroza opened 5 years ago

cgroza commented 5 years ago

Please describe:

  1. What you were trying to do
  2. What you wanted to happen
  3. What actually happened
  4. What data and command line to use to make the problem recur, if applicable

Hi, I am trying to produce a graph that contains large insertions. I have a set of VCF calls that contains the contig name in the ALT field, and a corresponding FASTA file indicating the sequence to be inserted. When I run

vg construct -S -p -C -R chr21 -r Homo.sapiens.hg38.fa -v VariantCalls/NA12878_svs_fixed.vcf.gz -I VariantCalls/NA12878_sv_subseqs.fa

I get a lot of warnings such as

Warning: could not set SEQ [canonicalize]. chr21        13368468        NA19240_chr21-13368468-INS-279  G       NA19240_chr21-13340000-13400000|ctg7180000000002|quiver/0_89439 3.0     .       AC=0;AN=0;BKPTID=NA19240_chr21-13368468-INS-279;CONTIG=NA19240_chr21-13340000-13400000|ctg7180000000002|quiver/0_89439;CONTIG_DEPTH=6;CONTIG_END=43321;CONTIG_START=43042;CONTIG_SUPPORT=2;END=13368468;MERGE_AC=1;MERGE_AF=0;MERGE_SAMPLES=.;MERGE_SOURCE=.;MERGE_VARIANTS=.;MERGE_VARIANTS_RO=1;PUBLISHED_ID=NA19240_chr21-13368468-INS-279;REPEAT_TYPE=TRF;SPAN=279;SVLEN=279;SVTYPE=INS

I did a run with -S and without S and the result vg files are identical. Therefore, I know these insertions are not included. Do you know what could be causing this? Is it the way my VCF is formatted?

Here is a sample of my VCF and a sample of my FASTA:

VCF entry:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA12878
chr1    59599   NA19434_chr1-59599-INS-308      A       NA19434_chr1-20000-80000-ctg7180000000004       5       .       SVTYPE=INS;SVLEN=308;END=59599;MERGE_SOURCE=.;MERGE_SAMPLES=.;MERGE_AC=1;MERGE_AF=0;MERGE_VARIANTS=.;MERGE_VARIANTS_RO=1;CONTIG_SUPPOR
T=3;CONTIG_DEPTH=7;CONTIG=NA19434_chr1-20000-80000-ctg7180000000004;CONTIG_START=3817;CONTIG_END=4125;REPEAT_TYPE=AluY_simple;BKPTID=NA19434_chr1-59599-INS-308;PUBLISHED_ID=NA19434_chr1-59599-INS-308;AC=0;AN=0 GT:CS:CD:RO     .:.:.:.
chr1    90069   NA19240_chr1-90069-INS-59       A       NA19240_chr1-80050-100552|ctg7180000000001|quiver/0_45755       4       .       SVTYPE=INS;SVLEN=59;END=90069;MERGE_SOURCE=.;MERGE_SAMPLES=.;MERGE_AC=1;MERGE_AF=0;MERGE_VARIANTS=.;MERGE_VARIANTS_RO=
1;CONTIG_SUPPORT=2;CONTIG_DEPTH=4;CONTIG=NA19240_chr1-80050-100552|ctg7180000000001|quiver/0_45755;CONTIG_START=14706;CONTIG_END=14765;REPEAT_TYPE=NotMasked;BKPTID=NA19240_chr1-90069-INS-59;PUBLISHED_ID=NA19240_chr1-90069-INS-59;AC=0;AN=0    GT:CS:CD:RO
     .:.:.:.

Corresponding FASTA entry:

>NA19434_chr1-20000-80000-ctg7180000000004
AGGATTCTTTCTTTTTTTTTTTTTTTTTTTTTGAGACGGAGTCTCGCTCTGTCGCCCAGGCTGGAGTGCAGCGGCGCGATCTCGGCTCACTGCAAGCTCCGCCTCCCGGGTTCACGCCATTCTCCTGCCTCAGCCTCCCAAGTAGCTGGGACTACAGGCGCCGCCACTACGCCCGGCTAATTTTTTGTTTTTAGTAGAGACGGGGTTTCACCGTTTTAGCCGGGATGGTCTCGATCTCCTGACTTCGTGATCCTCCCGCCTCGGCTCCAAAGTGCTGGGATTACAGGCGTGAGCCACCGCGCCCGGCC

If you need my files please let me know. My thanks.

cgroza commented 5 years ago

Alternatively, does anyone have a functioning VCF/FASTA pair that I can mimic?

ekg commented 5 years ago

Look in the tests. @edawson?

On Fri, Feb 1, 2019, 21:53 Groza Cristian <notifications@github.com wrote:

Alternatively, does anyone have a functioning VCF/FASTA pair that I can mimic?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2084#issuecomment-459880874, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EaqDpNNXx0Fm6jxQd7QM0mggZK8rks5vJLd0gaJpZM4aeo_U .

edawson commented 5 years ago

Insertion tags in the ALT field need to be wrapped in "<" and ">". Not having these will definitely cause the lookup to fail and print out that error message.

For example, the second VCF entry should be:

chr1 90069 NA19240_chr1-90069-INS-59 A <NA19240_chr1-80050-100552|ctg7180000000001|quiver/0_45755> 4 . SVTYPE=INS;SVLEN=59;END=90069;MERGE_SOURCE=.;MERGE_SAMPLES=.;MERGE_AC=1;MERGE_AF=0;MERGE_VARIANTS=.;MERGE_VARIANTS_RO= 1;CONTIG_SUPPORT=2;CONTIG_DEPTH=4;CONTIG=NA19240_chr1-80050-100552|ctg7180000000001|quiver/0_45755;CONTIG_START=14706;CONTIG_END=14765;REPEAT_TYPE=NotMasked;BKPTID=NA19240_chr1-90069-INS-59;PUBLISHED_ID=NA19240_chr1-90069-INS-59;AC=0;AN=0 GT:CS:CD:RO .:.:.:.

I think this is probably the issue. Your fasta entry looks good (no space between the ">" and the identifier, no spaces in identifier, etc). Happy to debug with your files, sorry for the delay in getting around to this!

cgroza commented 5 years ago

Hi, I tried your advice and I think it put me on the right track. Thanks a lot! Does vg read the genotype field by itself or do I need to prefilter it?

adamnovak commented 5 years ago

When constructing a graph, vg construct ignores the samples and just looks at the variants. So if you have a variant that is 0/0 in all your samples, it still gets added to the graph. If you don't want that behavior, you'll have to filter out all the variants that you don't want in the graph before giving the VCF to vg construct.

cgroza commented 5 years ago

Hi again, So it seems I fixed the SEQ errors but some new ones appear. Vg sometimes crashes, due to an exception thrown in side vcflib. When it does not crash, there is a warning printed from vcflib. I made a repo with my files for debugging puroposes. Does anyone have any clue on what might be the problem? All the files are on the hg38 reference.

https://github.com/cgroza/vg_vcf_fasta

Any help would be appreciated.

adamnovak commented 5 years ago

@cgroza I've looked at this a little bit. It looks like you're hitting this in VCFlib:

https://github.com/vcflib/vcflib/blob/7e3d8066a6c6c0f22191a70d050ead23f9c453b3/src/Variant.cpp#L1870-L1878

It looks like vcflib is trying to use Smith-Waterman padded with enough Z and Q characters to do global alignment, and is failing. Maybe its notion of "enough" is wrong? @ekg, do you want to re-engineer this to use a real global aligner? Or fix the padding algorithm? It looks like you wrote it originally.

We ought to be able to grep through the VCFs with the second Z-and-Q-padded DNA sequence after the error (which contains the alt sequence it is trying to align) in order to work out what variant(s) is/are the real problem. Then I think this has to become a bug report for vcflib, with just that variant and a tiny program to call parsedAlternates() on it

cgroza commented 5 years ago

Also this is caused by only a few VCF entries. I filtered the whole VCF for only NA12878 specific ones, and the problem no longer occurs, although the set of SVs is much smaller now.

ekg commented 5 years ago

It should be updated to a real global aligner. And it must be banded. Maybe dozeu would be the right thing.

edawson commented 5 years ago

VG (and VCFLIB) shouldn't be running Smith-Waterman when -S is passed, as it will reliably reach an example that crashes (especially if you try loading inversions).

I'm seeing S-W crashes on other data as well. Did the behavior of -S get changed at some point? @eldariont have you experienced this with your SV work?

eldariont commented 5 years ago

No, I've never observed something like this.

adamnovak commented 5 years ago

@edawson Do you have a stack trace for these crashes? If passing -S is supposed to make sure we never call parsedAlternates() (although we should still call it on non-structural variants, right?), we need to find where we actually do call it and make that not happen.

edawson commented 5 years ago
screen shot 2019-02-26 at 2 30 10 pm

I finally got a stack trace for this, but it didn't save to file for some reason. It looks like the crash is in S-W, but I don't think we should be doing this if we pass -S.

adamnovak commented 5 years ago

Yeah, that's construct_chunk calling parsedAlternates with -S in the command line. Presumably it thinks it is looking at a normal variant but is actually looking at an SV. We need to go through construct_chunk and work out what call this actually is. Running under gdb could get us a more accurate fix on that.

edawson commented 5 years ago

GDB says we're crashing here: https://github.com/vgteam/vg/blob/2f02b1486547d7c06d9bad1829c9af75302eec2f/src/constructor.cpp#L563-L577

I swear there used to be a condition to check if a variant flagged is_symbolic_sv() here and act accordingly. I'm also confused about how we pass the SV construct tests now, which tells me I probably didn't write very comprehensive tests. Off to go debug.