vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

How do different methods affect precision? #4366

Closed zhengluo-lz closed 2 months ago

zhengluo-lz commented 2 months ago

First, divide the VCF file into different chromosomes, then construct VG files for each chromosome separately and perform SV (structural variant) identification. Alternatively, construct a VG file for the whole genome first, then use vg chunk to partition it, and finally perform SV identification. Will the precision of these two methods differ?

jeizenga commented 2 months ago

You can't really have a pipeline that's split by chromosome the entire way through. At some point you will need to map reads to the graph, and when you do that, you need to have the full graph available to the mapping algorithm.

zhengluo-lz commented 2 months ago

I have another question. I observed that precision and recall increase as MAF increases, so I would like to ask whether you set a threshold to evaluate precision and recall for sites with MAF greater than this threshold.

jeizenga commented 2 months ago

Yes, higher MAF increases the chances that the variant is actually observed in the sample, so in general, higher MAF variants are more likely to be useful. I've seen different thresholds used in practice, but I don't know of a place where anyone has quantified the precision/recall effects of different thresholds.

There is a small literature on variant selection for pangenomes that you may be interested in looking at: https://academic.oup.com/bioinformatics/article/37/Supplement_1/i460/6319683 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1595-x

zhengluo-lz commented 2 months ago

Thanks!