vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.11k stars 194 forks source link

Handling of inversions in VCFs - no called variants within large inversion #3196

Open briannadon opened 3 years ago

briannadon commented 3 years ago

1. What were you trying to do? I have created some graphs from either PGGB or a multi-sequence alignment from 10 salmonella sequences, with a large (~1Mbp) inversion in the middle. Then, I wanted to inspect variants from that graph either with vg deconstruct or by feeding short reads into the graph with the vg call pipeline.

2. What did you want to happen? I expected to see perhaps one large variant indicating an inversion, and then small variants within that inversion: SNPS, indels, etc.

3. What actually happened? Instead, the VCFs in both the deconstruct and call case produced one giant 1Mbp variant, approximately the length of the entire inversion, with 10 different genotypes, and no SNPs etc within.

However, importantly, aligning these genomes with a different approach that removes the inversion, producing a multi-fasta alignment, and creating a graph with something like:

vg construct -M multi_align.fasta -F fasta

does produce the desired behavior, at the cost of not representing the inversion at all.

5. What data and command can the vg dev team use to make the problem happen? I have input GFAs and VGs available on request, don't know if I can attach files here somehow, sorry if I missed that ability.

6. What does running vg version say?

vg version v1.30.0 "Carentino"
Compiled with g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 on Linux
Linked against libstd++ 20200808
Built by anovak@octagon
glennhickey commented 3 years ago

Yes, this is a known issue: call and deconstruct both only emit the "top-level" bubble, and alleles through that. So any variants nested in your inversion will come out embedded in giant alleles that include the inversion itself.

In theory, there's not too much engineering needed to produce the nested variants, as we have most of this information in the snarl decomposition, though VCF isn't really designed for it. Hopefully it will make it into a future release soon.

briannadon commented 3 years ago

Ok, thanks for the answer. I was wondering if calculating snarls with a higher --max-nodes would make a difference, or maybe even running deconstruct with all-snarls, but correct me if I understand this wrong: this limitation is inherent to the call and deconstruct commands, not the underlying snarl computation?