Low "Total perfect" Read Count in vg stats -a Output

YiBenqiu commented 11 months ago

Hi vg team,

I recently ran vg stats -a on my sample GAM file (sample.gam) generated from a whole-genome sequencing project and observed a low count of "Total perfect" reads. I'm seeking some insights or suggestions on what might be causing this.

Here are the key statistics from the vg stats -a sample.gam output:

Total alignments: 588454870
Total primary: 588454870
Total secondary: 0
Total aligned: 586276592
Total perfect: 67686
Total gapless (softclips allowed): 553484228
Total paired: 588454870
Total properly paired: 581549144
Alignment score: mean 118.233, median 140, stdev 46.2037, max 160 (67686 reads)
Mapping quality: mean 46.8843, median 60, stdev 23.3427, max 60 (428335113 reads)
Insertions: 36071035 bp in 15965942 read events
Deletions: 57202733 bp in 21926589 read events
Substitutions: 794004238 bp in 794004238 read events
Softclips: 17409353571 bp in 562438154 read events
Total time: 1.10365e+06 seconds
Speed: 533.191 reads/second

I saw in a previous question that the answer stated: 'perfect' means that there are no mismatches, indels, or soft clips in the alignment #3190 . Could the low count of 'perfect' reads be due to my graph genome containing a large number of variations (about 1 million SVs)? I'm not sure if this is a normal phenomenon

jltsiren commented 11 months ago

It could also mean that none of the haplotypes in the graph is a good match for the reads almost anywhere. What do you have in the graph, how did you build it, and how did you index it? And what are the reads you are mapping to the graph?

Another explanation could be that you have Ns in most reads. With 150 bp reads, you should get alignment score 140 if there is one N at both ends.

YiBenqiu commented 11 months ago

It could also mean that none of the haplotypes in the graph is a good match for the reads almost anywhere. What do you have in the graph, how did you build it, and how did you index it? And what are the reads you are mapping to the graph?

Another explanation could be that you have Ns in most reads. With 150 bp reads, you should get alignment score 140 if there is one N at both ends.

I used vg autoindex for graph construction with the following command:

vg autoindex --workflow giraffe -r $ref $(for i in ${chrs[*]}; do echo -v ${split_chrs}/convert.${i}_SV.vcf.gz; done) --tmp-dir ${LARGE_DISK_TMP} -t 24 -p out.

Initially, my input SV.vcf.gz had more than 96 individuals (Upon reviewing other submitted questions https://github.com/vgteam/vg/issues/4144, we found that in the 'autoindex --workflow giraffe' process, once the number of haplotypes reaches 192, vg autoindex downsamples them to 64 synthetic haplotypes.), so I first built the graph using all individuals, then ran vg giraffe/vg pack -Q 5/vg call -a -z. I noticed that many SVs from the input SV.vcf were missing in the vg call results. We thought whether it's possible to disregard the authenticity of the haplotypes since our focus is on the genotypes of individual SVs in the second-generation sequencing data during subsequent vg call. Therefore, to retain more SVs, I extracted 96 individuals from the input SV.vcf and randomly changed their genotypes from .|. to 0|1 or 1|1. Then, I constructed the graph a second time and reran vg giraffe/vg pack -Q 5/vg call -a -z. This time, the numbers were more consistent, and also, my sequencing data has a depth of at least 30X (PE150).

YiBenqiu commented 11 months ago

I ran the following command:

vg gbwt -M -Z ${GRAPHS}/out.giraffe.gbz
vg paths -L -G -x ${GRAPHS}/out.giraffe.gbz | wc -l
vg gbwt --tags -Z ${GRAPHS}/out.giraffe.gbz

returned:

9880 paths with names, 65 samples with names, 65 haplotypes, 152 contigs with names
152
reference_samples   
source  jltsiren/gbwt

jltsiren commented 11 months ago

Giraffe is a haplotype-based aligner. If you don't start with properly phased haplotypes (ideally chromosome-length ones), the heuristics in vg autoindex are unlikely to work well. And if you have .|. in the genotype field, vg interprets it as 0|0. But if you have phased haplotypes, the downsampling heuristics should work reasonably, even when starting from thousands of haplotypes.

vgteam / vg

Low "Total perfect" Read Count in vg stats -a Output #4174