Low F1 Scores for DEL and INS

YibinQiu commented 10 months ago

Hi vg team,

We used ONT data from 100 individuals (20X coverage per individual) for SV calling using the Sniffles2 software. Subsequently, we used vg construct to create a graph. Out of these 100 individuals, 10 also had their second-generation sequencing data (PE150).We aim to compare the SV.vcf obtained from the vg giraffe/pack/call (-a -A -z) for these 10 second-generation sequencing data with the corresponding ONT SV.vcf. We consider the SVs identified from the ONT data of these individuals as the true set. We use sveval R package with its snakemake pepeline (https://github.com/jmonlong/sveval/tree/master/snakemake)

The F1 scores for DEL (deletions) and INS (insertions) are quite low. We suspect that this might be because the SV set we inputted during the vg construct was too complex (nomerge). Therefore, we used the Truvari software to merge the SV set (merge) and then redid the vg construct/vg giraffe/vg pack/vg call (-a -A -z) process. However, the results were still not satisfactory. We can't understand why the scores are so low. Could you provide us with some guidance? This is very important to us, thank you. pan-merge-nomerge.pdf

glennhickey commented 10 months ago

Yeah, that's much lower than the results we've gotten from, ex, here, and all the more recent work on MC graphs.

One thing to note is that you shouldn't be using vg call -A (we didn't in the above link) unless you are going to filter out nested sites, since it will make redundant calls. But I'm not sure if this is the main driver of your poor results (I'd think it would effect precision more than recall).

Calling on SVs from vg construct is always a challenge, like you mentioned, because of complexity in the graph. It could be that truvari merge isn't sufficient. In particular, you probably need to add some kind of allele-frequency filter. For the HPRC work, we used minimum 10%.

There's also the new "personal pangenome" approach, that we used instead of allele-frequency filtering: https://www.biorxiv.org/content/10.1101/2023.12.13.571553v2.abstract It really helps vg call. We've focused mostly on minigraph-cactus pangenomes in that work, but it may help with your graph (especially if your input VCF was phased).

YibinQiu commented 10 months ago

Thank you for your suggestion. Based on the preprint, I found the wiki page for Haplotype-Sampling (https://github.com/vgteam/vg/wiki/Haplotype-Sampling). From what I understand, Haplotype-Sampling involves using the second-generation data of the individual, first selecting haplotypes for the graph to create a personalized pangenome and then proceeding with vg griffe/pack/call. I noticed that the read coverage should be at least 20x, but my second-generation sequencing data is only ~10x (PE150), which might not meet this requirement. So, I am wondering if it's necessary for me to do this?

Additionally, after building a personalized graph, do I need to redo the vg snarls step (vg snarls sample.gbz >sample.snarls; vg call sample.gbz -r sample.snarls -a -z)? Or can I directly use the snarls file from the original graph (vg call sample.gbz -r graph.snarls -a -z)? I utilized ~10 x second-generation sequencing data (PE150) and followed the steps outlined in the wiki page for Haplotype-Sampling. During the vg call process, I used the snarls from the original graph (as regenerating snarls files for sample.gbz tends to be time-consuming and memory-intensive).

vgteam / vg

Low F1 Scores for DEL and INS #4205