vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 193 forks source link

The number of variations in the pan-genome is reduced compared to the variations in the input VCF file #4172

Open wk1352313 opened 9 months ago

wk1352313 commented 9 months ago

Do vg filter out some variants during the construction of the pan-genome, and if so, what are the criteria for filtering? The number of variations in the pan-genome is reduced compared to the variations in the input VCF file. The command I used is "vg autoindex --workflow giraffe -v sv.vcf -r ref.genome -p sv -t 64" The number of variants decreased by almost half after undergoing vg deconstruct compared to the number of variants in the VCF used to construct the pan-genome graph. What could be the reason for this?

jltsiren commented 9 months ago

The usual reason is that the VCF contains overlapping variants, and vg deconstruct combines them into a single variant. When the variants overlap, the don't exist separately in the graph.

If you are dealing with structural variants, vg construct may filter out some of them. See the wiki for further information.

On a bit more fundamental level, VCF is inadequate for storing anything beyond simple non-overlapping edits to the reference. The standard does not fully specify how overlapping variants should be interpreted, and different tools often have different subtly incompatible interpretations.

wk1352313 commented 9 months ago

Thank you! Your suggestions have inspired me.

The usual reason is that the VCF contains overlapping variants, and vg deconstruct combines them into a single variant. When the variants overlap, the don't exist separately in the graph.

If you are dealing with structural variants, vg construct may filter out some of them. See the wiki for further information.

On a bit more fundamental level, VCF is inadequate for storing anything beyond simple non-overlapping edits to the reference. The standard does not fully specify how overlapping variants should be interpreted, and different tools often have different subtly incompatible interpretations.