Open cgroza opened 5 years ago
Alternatively, does anyone have a functioning VCF/FASTA pair that I can mimic?
Look in the tests. @edawson?
On Fri, Feb 1, 2019, 21:53 Groza Cristian <notifications@github.com wrote:
Alternatively, does anyone have a functioning VCF/FASTA pair that I can mimic?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2084#issuecomment-459880874, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EaqDpNNXx0Fm6jxQd7QM0mggZK8rks5vJLd0gaJpZM4aeo_U .
Insertion tags in the ALT field need to be wrapped in "<" and ">". Not having these will definitely cause the lookup to fail and print out that error message.
For example, the second VCF entry should be:
chr1 90069 NA19240_chr1-90069-INS-59 A <NA19240_chr1-80050-100552|ctg7180000000001|quiver/0_45755> 4 . SVTYPE=INS;SVLEN=59;END=90069;MERGE_SOURCE=.;MERGE_SAMPLES=.;MERGE_AC=1;MERGE_AF=0;MERGE_VARIANTS=.;MERGE_VARIANTS_RO= 1;CONTIG_SUPPORT=2;CONTIG_DEPTH=4;CONTIG=NA19240_chr1-80050-100552|ctg7180000000001|quiver/0_45755;CONTIG_START=14706;CONTIG_END=14765;REPEAT_TYPE=NotMasked;BKPTID=NA19240_chr1-90069-INS-59;PUBLISHED_ID=NA19240_chr1-90069-INS-59;AC=0;AN=0 GT:CS:CD:RO .:.:.:.
I think this is probably the issue. Your fasta entry looks good (no space between the ">" and the identifier, no spaces in identifier, etc). Happy to debug with your files, sorry for the delay in getting around to this!
Hi, I tried your advice and I think it put me on the right track. Thanks a lot! Does vg read the genotype field by itself or do I need to prefilter it?
When constructing a graph, vg construct
ignores the samples and just looks at the variants. So if you have a variant that is 0/0
in all your samples, it still gets added to the graph. If you don't want that behavior, you'll have to filter out all the variants that you don't want in the graph before giving the VCF to vg construct
.
Hi again, So it seems I fixed the SEQ errors but some new ones appear. Vg sometimes crashes, due to an exception thrown in side vcflib. When it does not crash, there is a warning printed from vcflib. I made a repo with my files for debugging puroposes. Does anyone have any clue on what might be the problem? All the files are on the hg38 reference.
https://github.com/cgroza/vg_vcf_fasta
Any help would be appreciated.
@cgroza I've looked at this a little bit. It looks like you're hitting this in VCFlib:
It looks like vcflib is trying to use Smith-Waterman padded with enough Z
and Q
characters to do global alignment, and is failing. Maybe its notion of "enough" is wrong? @ekg, do you want to re-engineer this to use a real global aligner? Or fix the padding algorithm? It looks like you wrote it originally.
We ought to be able to grep through the VCFs with the second Z-and-Q-padded DNA sequence after the error (which contains the alt sequence it is trying to align) in order to work out what variant(s) is/are the real problem. Then I think this has to become a bug report for vcflib, with just that variant and a tiny program to call parsedAlternates()
on it
Also this is caused by only a few VCF entries. I filtered the whole VCF for only NA12878 specific ones, and the problem no longer occurs, although the set of SVs is much smaller now.
It should be updated to a real global aligner. And it must be banded. Maybe dozeu would be the right thing.
VG (and VCFLIB) shouldn't be running Smith-Waterman when -S
is passed, as it will reliably reach an example that crashes (especially if you try loading inversions).
I'm seeing S-W crashes on other data as well. Did the behavior of -S
get changed at some point? @eldariont have you experienced this with your SV work?
No, I've never observed something like this.
@edawson Do you have a stack trace for these crashes? If passing -S
is supposed to make sure we never call parsedAlternates()
(although we should still call it on non-structural variants, right?), we need to find where we actually do call it and make that not happen.
I finally got a stack trace for this, but it didn't save to file for some reason. It looks like the crash is in S-W, but I don't think we should be doing this if we pass -S
.
Yeah, that's construct_chunk
calling parsedAlternates
with -S
in the command line. Presumably it thinks it is looking at a normal variant but is actually looking at an SV. We need to go through construct_chunk
and work out what call this actually is. Running under gdb could get us a more accurate fix on that.
GDB says we're crashing here: https://github.com/vgteam/vg/blob/2f02b1486547d7c06d9bad1829c9af75302eec2f/src/constructor.cpp#L563-L577
I swear there used to be a condition to check if a variant flagged is_symbolic_sv()
here and act accordingly. I'm also confused about how we pass the SV construct tests now, which tells me I probably didn't write very comprehensive tests. Off to go debug.
Please describe:
Hi, I am trying to produce a graph that contains large insertions. I have a set of VCF calls that contains the contig name in the ALT field, and a corresponding FASTA file indicating the sequence to be inserted. When I run
I get a lot of warnings such as
I did a run with -S and without S and the result vg files are identical. Therefore, I know these insertions are not included. Do you know what could be causing this? Is it the way my VCF is formatted?
Here is a sample of my VCF and a sample of my FASTA:
VCF entry:
Corresponding FASTA entry:
If you need my files please let me know. My thanks.