Open Tonitsk8264 opened 1 month ago
Yeah, this is expected and super annoying. The issue is that the snarl decomposition, which determines the sites in the VCF, is somewhat dependent on the "rooting" of the snarl tree. And this rooting can be insconsistent between a graph and subgraph, when computed independently.
One work-around might be to use vg snarls <graph> -A cactus > <graph>.snarls
on both graphs, then pass the output to deconstruct -r
. This reverts back to the older snarl logic which I think had some heuristics to anchor snarls to reference paths. @adamnovak how feasible do you think it would be to add something like this back to the new snarl finder?
I think the main obstacle to doing something like that was UI; we'd need to come up with a way for the user to explain how they want their snarls rooted and then bring it all the way through, and we'd probably want to have it in anything that can need to make snarls.
Now that we have a stronger notion of reference-sense paths, it might be possible to prefer those for anchoring by default?
Hi,
I am wondering if it is possible to use vg snarls -A cactus
(or any other way) to deconstruct consistent variants between haplotype sampled graphs and the original graph? This would be very helpful for SV merging when using haplotype sampling in population study.
The idea now is for using haplotype sampling only for mapping. You still call all samples on the full graph (with which the sampled GAMs are compatible)
Hi @glennhickey ,
Thanks for the suggestion, I have tried calling variants on the full graph. I ran vg deconstruct -a
and vg call -A -a -z
with the same gbz and snarls file and then used vcfbub
, decompose tool, and bcftools norm -f ref.fa -m -
to generate decomposed biallelic VCFs for comparison.
When considering variants with INDEL length> 20 (i.e., bcftools view -i 'abs(ILEN)>20
):
vg call
is ~20% of vg deconstruct
vg call
can be exactly matched with one variant in vg deconstruct
by chr, pos, ref, altIt is really great to see most of variants of vg call
are consistent with vg deconstruct
, but there are a lot of variants are missing. I read the previous discussion about the difference between vg call
and vg deconstruct
in #3888 , is such difference still expected? For the variants missing in vg call
, is it OK to assign them as './.' or even '0/0'?
Hi!
I used vg chunk to extract subgraphs from a pan-genome graph and then used vg deconstruct to obtain VCF files for both the pan-genome graph and the subgraph. However, I noticed that the VCF file for the subgraph contains variants that are not present in the pan-genome graph. Is this expected?
Thank you very much!