vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

vg call crash: Node 6010259 is on backbone path at 316281 but not traced in site 6105521 rev to 6071743 rev that contains it #1946

Closed eldariont closed 5 years ago

eldariont commented 5 years ago

Hi,

toil-vg crashed again on the cactus graph of yeast genomes. I was able to reproduce the issue on my laptop with v1.11.0-74-gdab42acd:

>vg call chunk_Y12.chrIII_0_aug.vg -t 1 -S SRR4074358 -z chunk_Y12.chrIII_0.trans -s chunk_Y12.chrIII_0.support -b chunk_Y12.chrIII_0.vg -r Y12.chrIII -c Y12.chrIII -l 322503 -o 0
##fileformat=VCFv4.2
##ALT=<ID=NON_REF,Des

cription="Represents any possible alternative allele at this location">
##INFO=<ID=XREF,Number=0,Type=Flag,Description="Present in original graph">
##INFO=<ID=XSEE,Number=.,Type=String,Description="Original graph node:offset cross-references">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles">
##FILTER=<ID=lowad,Description="Variant does not meet minimum allele read support threshold of 5">
##FILTER=<ID=highabsdp,Description="Variant has total depth greater than 0">
##FILTER=<ID=highreldp,Description="Variant has total depth greater than 0 times global baseline">
##FILTER=<ID=highlocaldp,Description="Variant has total depth greater than 0 times local baseline">
##FILTER=<ID=lowxadl,Description="Variant has AD log likelihood less than -9">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=XDP,Number=2,Type=Integer,Description="Expected Local and Global Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=XADL,Number=1,Type=Float,Description="Likelihood of allelic depths for called alleles">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Forward and reverse support for ref and alt alleles.">
##FORMAT=<ID=XAAD,Number=1,Type=Integer,Description="Alt allele read count.">
##FORMAT=<ID=AL,Number=.,Type=Float,Description="Allelic likelihoods for the ref and alt alleles in the order listed">
##contig=<ID=Y12.chrIII,length=322503>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SRR4074358
Y12.chrIII  7206    6036561_5989208_6133823_6074745-6036561_5989208_6118195_6074745 T   A   2491    lowxadl DP=308;SVLEN=0;XSEE=82,83,84,85,5412161,5807682,5807683;XREF    GT:DP:XDP:AD:XADL:SB:XAAD   0/1:308:154,216:240,68:-53.475054:31,209,1,67:68
Y12.chrIII  211 6144324_6107873-6144324_6011444_6107873 CAC C   1052    PASS    DP=98;SVLEN=-2;XSEE=5095336,5095337,5095338;XREF    GT:DP:XDP:AD:XADL:SB:XAAD   1/0:98:56,216:42,56:-2.359967:0,42,0,56:56
Y12.chrIII  5795    6056451_6074012_6112048_6161926-6056451_5990515_6112048_6161926 A   T   15  lowad   DP=4;SVLEN=0;XREF   GT:DP:XDP:AD:XADL:SB:XAAD   0/1:4:3,216:3,1:-1.163151:1,2,0,1:1
Y12.chrIII  6400    6050181_6131609_6072324 G   A   15  lowad   DP=1;SVLEN=0;XREF   GT:DP:XDP:AD:XADL:SB:XAAD   1/1:1:4,216:0,1:.:0,0,0,1:1
Node 6010259 is on backbone path at 316281 but not traced in site 6105521 rev to 6071743 rev that contains it.
terminate called after throwing an instance of 'std::runtime_error'
  what():  Extra ref node found
ERROR: Signal 6 occurred. VG has crashed. Run 'vg bugs --new' to report a bug.
Stack trace path: /tmp/vg_crash_9efnmQ/stacktrace.txt

Files: EDIT http://public.gi.ucsc.edu/~daheller/crash/

It would be great if someone could help me with this error as I don't really know what it means. Maybe @adamnovak ? :)

Many thanks David

adamnovak commented 5 years ago

Basically, the problem is that the caller traced through the site from start to end along the reference path, but then found another node on the reference path inside the site that it didn't trace through. It has determined something is inconsistent, and crashed.

I'm having trouble downloading the files from the Dropbox web site; it just sits and doesn't ever produce a file when I hit "Direct Download". Can you post a direct link to the zip that I can wget? Or give me a path on Courtyard to look in?

eldariont commented 5 years ago

Sorry for the problem with the files. Try the files in this directory: http://public.gi.ucsc.edu/~daheller/crash/

adamnovak commented 5 years ago

OK, this could be a tricky problem:

Graph image

The underlying issue is that the reference path we are trying to call against goes through the snarl we are trying to call twice, taking two different paths through it. That's how we manage to trace the reference path through the snarl and still have left-over untraced reference nodes.

vg call is not currently designed to deal with this situation. We can't handle cases where the reference path we are calling against contains duplications which are represented as cycles in the graph. We assume that every snarl we are calling appears at exactly one place on the reference path we are calling against.

I'm not sure what changing that assumption in vg call would take; we'd at least have to produce multiple VCF records to represent the different occurrences of the snarl. We'd have to somehow apportion alleles of the snarl between those records, which is in general a phasing problem, which is something that call doesn't address at all right now. We'd also have to move away from call's reference/best/second-best model to something that can accommodate multiple references and calls with more than 3 alleles under consideration, if we're going to handle >diploid sites. I think vg genotype already has support for considering larger numbers of alleles, but on the other hand it seems to be this ref/best/second-best model that lets us write the useful heuristics that make call good.

The other option is to just exclude these doubled-up regions from calling, because they are doubled-up in the reference graph, but not crash while we do it.

To work around this, it looks like what you need to do is to set the refGenome option of hal2vg to the genome you are planning to call against, and leave the refDupes option off as it is by default. Then hal2vg will not merge any bases in that genome against each other when importing the Cactus alignments, which means that it will be a well-behaved reference path for use with vg call.

adamnovak commented 5 years ago

OK, I talked to @eldariont and we concluded that we want to keep vg call constricted to calling against non-cyclic reference paths for now, because the alternative is a bit muzzy, and because we don't want to do things like ignoring swathes of your target reference.

I'm going to try and improve the error message slightly, to explain how this can happen when the reference path doubles back through the graph, and close the issue with that.