Closed eldariont closed 5 years ago
Basically, the problem is that the caller traced through the site from start to end along the reference path, but then found another node on the reference path inside the site that it didn't trace through. It has determined something is inconsistent, and crashed.
I'm having trouble downloading the files from the Dropbox web site; it just sits and doesn't ever produce a file when I hit "Direct Download". Can you post a direct link to the zip that I can wget? Or give me a path on Courtyard to look in?
Sorry for the problem with the files. Try the files in this directory: http://public.gi.ucsc.edu/~daheller/crash/
OK, this could be a tricky problem:
The underlying issue is that the reference path we are trying to call against goes through the snarl we are trying to call twice, taking two different paths through it. That's how we manage to trace the reference path through the snarl and still have left-over untraced reference nodes.
vg call
is not currently designed to deal with this situation. We can't handle cases where the reference path we are calling against contains duplications which are represented as cycles in the graph. We assume that every snarl we are calling appears at exactly one place on the reference path we are calling against.
I'm not sure what changing that assumption in vg call would take; we'd at least have to produce multiple VCF records to represent the different occurrences of the snarl. We'd have to somehow apportion alleles of the snarl between those records, which is in general a phasing problem, which is something that call doesn't address at all right now. We'd also have to move away from call's reference/best/second-best model to something that can accommodate multiple references and calls with more than 3 alleles under consideration, if we're going to handle >diploid sites. I think vg genotype already has support for considering larger numbers of alleles, but on the other hand it seems to be this ref/best/second-best model that lets us write the useful heuristics that make call good.
The other option is to just exclude these doubled-up regions from calling, because they are doubled-up in the reference graph, but not crash while we do it.
To work around this, it looks like what you need to do is to set the refGenome
option of hal2vg to the genome you are planning to call against, and leave the refDupes option off as it is by default. Then hal2vg will not merge any bases in that genome against each other when importing the Cactus alignments, which means that it will be a well-behaved reference path for use with vg call.
OK, I talked to @eldariont and we concluded that we want to keep vg call constricted to calling against non-cyclic reference paths for now, because the alternative is a bit muzzy, and because we don't want to do things like ignoring swathes of your target reference.
I'm going to try and improve the error message slightly, to explain how this can happen when the reference path doubles back through the graph, and close the issue with that.
Hi,
toil-vg crashed again on the cactus graph of yeast genomes. I was able to reproduce the issue on my laptop with v1.11.0-74-gdab42acd:
Files: EDIT http://public.gi.ucsc.edu/~daheller/crash/
It would be great if someone could help me with this error as I don't really know what it means. Maybe @adamnovak ? :)
Many thanks David