vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.11k stars 194 forks source link

Low "Total perfect" Read Count in vg stats -a Output #4174

Open YiBenqiu opened 11 months ago

YiBenqiu commented 11 months ago

Hi vg team,

I recently ran vg stats -a on my sample GAM file (sample.gam) generated from a whole-genome sequencing project and observed a low count of "Total perfect" reads. I'm seeking some insights or suggestions on what might be causing this.

Here are the key statistics from the vg stats -a sample.gam output:

I saw in a previous question that the answer stated: 'perfect' means that there are no mismatches, indels, or soft clips in the alignment #3190 . Could the low count of 'perfect' reads be due to my graph genome containing a large number of variations (about 1 million SVs)? I'm not sure if this is a normal phenomenon

jltsiren commented 11 months ago

It could also mean that none of the haplotypes in the graph is a good match for the reads almost anywhere. What do you have in the graph, how did you build it, and how did you index it? And what are the reads you are mapping to the graph?

Another explanation could be that you have Ns in most reads. With 150 bp reads, you should get alignment score 140 if there is one N at both ends.

YiBenqiu commented 11 months ago

It could also mean that none of the haplotypes in the graph is a good match for the reads almost anywhere. What do you have in the graph, how did you build it, and how did you index it? And what are the reads you are mapping to the graph?

Another explanation could be that you have Ns in most reads. With 150 bp reads, you should get alignment score 140 if there is one N at both ends.

I used vg autoindex for graph construction with the following command:

vg autoindex --workflow giraffe -r $ref $(for i in ${chrs[*]}; do echo -v ${split_chrs}/convert.${i}_SV.vcf.gz; done) --tmp-dir ${LARGE_DISK_TMP} -t 24 -p out.

Initially, my input SV.vcf.gz had more than 96 individuals (Upon reviewing other submitted questions https://github.com/vgteam/vg/issues/4144, we found that in the 'autoindex --workflow giraffe' process, once the number of haplotypes reaches 192, vg autoindex downsamples them to 64 synthetic haplotypes.), so I first built the graph using all individuals, then ran vg giraffe/vg pack -Q 5/vg call -a -z. I noticed that many SVs from the input SV.vcf were missing in the vg call results. We thought whether it's possible to disregard the authenticity of the haplotypes since our focus is on the genotypes of individual SVs in the second-generation sequencing data during subsequent vg call. Therefore, to retain more SVs, I extracted 96 individuals from the input SV.vcf and randomly changed their genotypes from .|. to 0|1 or 1|1. Then, I constructed the graph a second time and reran vg giraffe/vg pack -Q 5/vg call -a -z. This time, the numbers were more consistent, and also, my sequencing data has a depth of at least 30X (PE150).

YiBenqiu commented 11 months ago

I ran the following command:

vg gbwt -M -Z ${GRAPHS}/out.giraffe.gbz
vg paths -L -G -x ${GRAPHS}/out.giraffe.gbz | wc -l
vg gbwt --tags -Z ${GRAPHS}/out.giraffe.gbz

returned:

9880 paths with names, 65 samples with names, 65 haplotypes, 152 contigs with names
152
reference_samples   
source  jltsiren/gbwt
jltsiren commented 11 months ago

Giraffe is a haplotype-based aligner. If you don't start with properly phased haplotypes (ideally chromosome-length ones), the heuristics in vg autoindex are unlikely to work well. And if you have .|. in the genotype field, vg interprets it as 0|0. But if you have phased haplotypes, the downsampling heuristics should work reasonably, even when starting from thousands of haplotypes.