vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.09k stars 193 forks source link

About the usage of vg deconstruct #4356

Open Tonitsk8264 opened 1 month ago

Tonitsk8264 commented 1 month ago

Hi!

I used vg chunk to extract subgraphs from a pan-genome graph and then used vg deconstruct to obtain VCF files for both the pan-genome graph and the subgraph. However, I noticed that the VCF file for the subgraph contains variants that are not present in the pan-genome graph. Is this expected?

Thank you very much!

企业微信截图_1721986245558

企业微信截图_17219861274506

glennhickey commented 1 month ago

Yeah, this is expected and super annoying. The issue is that the snarl decomposition, which determines the sites in the VCF, is somewhat dependent on the "rooting" of the snarl tree. And this rooting can be insconsistent between a graph and subgraph, when computed independently.

One work-around might be to use vg snarls <graph> -A cactus > <graph>.snarls on both graphs, then pass the output to deconstruct -r. This reverts back to the older snarl logic which I think had some heuristics to anchor snarls to reference paths. @adamnovak how feasible do you think it would be to add something like this back to the new snarl finder?

adamnovak commented 1 month ago

I think the main obstacle to doing something like that was UI; we'd need to come up with a way for the user to explain how they want their snarls rooted and then bring it all the way through, and we'd probably want to have it in anything that can need to make snarls.

Now that we have a stronger notion of reference-sense paths, it might be possible to prefer those for anchoring by default?

Han-Cao commented 1 month ago

Hi,

I am wondering if it is possible to use vg snarls -A cactus (or any other way) to deconstruct consistent variants between haplotype sampled graphs and the original graph? This would be very helpful for SV merging when using haplotype sampling in population study.

glennhickey commented 1 month ago

The idea now is for using haplotype sampling only for mapping. You still call all samples on the full graph (with which the sampled GAMs are compatible)

Han-Cao commented 1 month ago

Hi @glennhickey ,

Thanks for the suggestion, I have tried calling variants on the full graph. I ran vg deconstruct -a and vg call -A -a -z with the same gbz and snarls file and then used vcfbub, decompose tool, and bcftools norm -f ref.fa -m - to generate decomposed biallelic VCFs for comparison.

When considering variants with INDEL length> 20 (i.e., bcftools view -i 'abs(ILEN)>20):

It is really great to see most of variants of vg call are consistent with vg deconstruct, but there are a lot of variants are missing. I read the previous discussion about the difference between vg call and vg deconstruct in #3888 , is such difference still expected? For the variants missing in vg call, is it OK to assign them as './.' or even '0/0'?