Open briannadon opened 3 years ago
Yes, this is a known issue: call
and deconstruct
both only emit the "top-level" bubble, and alleles through that. So any variants nested in your inversion will come out embedded in giant alleles that include the inversion itself.
In theory, there's not too much engineering needed to produce the nested variants, as we have most of this information in the snarl decomposition, though VCF isn't really designed for it. Hopefully it will make it into a future release soon.
Ok, thanks for the answer. I was wondering if calculating snarls
with a higher --max-nodes
would make a difference, or maybe even running deconstruct
with all-snarls
, but correct me if I understand this wrong: this limitation is inherent to the call
and deconstruct
commands, not the underlying snarl computation?
1. What were you trying to do? I have created some graphs from either PGGB or a multi-sequence alignment from 10 salmonella sequences, with a large (~1Mbp) inversion in the middle. Then, I wanted to inspect variants from that graph either with
vg deconstruct
or by feeding short reads into the graph with thevg call
pipeline.2. What did you want to happen? I expected to see perhaps one large variant indicating an inversion, and then small variants within that inversion: SNPS, indels, etc.
3. What actually happened? Instead, the VCFs in both the
deconstruct
andcall
case produced one giant 1Mbp variant, approximately the length of the entire inversion, with 10 different genotypes, and no SNPs etc within.However, importantly, aligning these genomes with a different approach that removes the inversion, producing a multi-fasta alignment, and creating a graph with something like:
does produce the desired behavior, at the cost of not representing the inversion at all.
5. What data and command can the vg dev team use to make the problem happen? I have input GFAs and VGs available on request, don't know if I can attach files here somehow, sorry if I missed that ability.
6. What does running
vg version
say?