vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

graph genome did not improve the short reads mapping #3614

Open biozzq opened 2 years ago

biozzq commented 2 years ago

Dear all,

When comparing graph genome and linear genome in short reads mapping, I found that graph genome did not improve the short reads mapping. Has this phenomenon also occurred in your side? Furthermore, I wonder that how can i evaluate the quality of the graph genome contruction, and which indices can be used as a good measure of the quality. Thank you in advance.

$vg stats -a DB1.gam
Total alignments: 159719776
Total primary: 159719776
Total secondary: 0
Total aligned: 104789794
Total perfect: 55712350
Total gapless (softclips allowed): 101320330
Total paired: 159719776
Total properly paired: 103162690
Insertions: 4637982 bp in 2216300 read events
Deletions: 6511681 bp in 2909677 read events
Substitutions: 114416338 bp in 114416338 read events
Softclips: 992003018 bp in 17342859 read events
$samtools flagstat DB.sam
159781961 + 0 in total (QC-passed reads + QC-failed reads)
159719776 + 0 primary
0 + 0 secondary
62185 + 0 supplementary
0 + 0 duplicates
0 + 0 primary duplicates
157616534 + 0 mapped (98.64% : N/A)
157554349 + 0 primary mapped (98.64% : N/A)
159719776 + 0 paired in sequencing
79859888 + 0 read1
79859888 + 0 read2
154129718 + 0 properly paired (96.50% : N/A)
156358852 + 0 with itself and mate mapped
1195497 + 0 singletons (0.75% : N/A)
2130432 + 0 with mate mapped to a different chr
1367059 + 0 with mate mapped to a different chr (mapQ>=5)

Sincerely, Zheng zhuqing

lt11 commented 2 years ago

Hi,

Which metrics did you take into account to compare the mapping on the graph (vg map, default settings) vs the mapping on the linear genome (with bwa, default settings)? And which software/settings did you use? Last but not least, which organism? I made some tests (with yeast sequences) using the percentage of mapped reads and the percentage of perfect mappings. In both cases the graph worked better.

Best,

Lorenzo

biozzq commented 2 years ago

Dear @lt11

Thank you. I used following command to map the short reads to the graph genome which was constructed by using pggb. The graph genome was constructed using the two sets of partially phased contigs generated by hifisam. When mapping to linear reference genome, the BWA-MEM default settings were used. The species i analyzed is the pig. In my side, I'm a little concerned about the quality of the graph genome. So, I want to know how to evaluate it. Thank you for your help.

vg (version v1.38.0) autoindex -g DB1.graph.fa.8c34847.4030258.a9c39c9.smooth.gfa -w giraffe -t 4 -T ./ -p DB1
vg giraffe -Z DB1.giraffe.gbz -m DB1.min -d DB1.dist -f DB_R1.fq.gz -f DB_R2.fq.gz > DB1.gam

Sincerely, Zheng zhuqing

lt11 commented 2 years ago

So our approaches are quite different. But I would like to stress that the benefits of using a graph can be quantified using a number of measures (e.g. the total number of mapped reads, the number of reads with perfect matches, …) or specific cases (e.g. regions which are not included in the linear reference). I'm afraid that other measures (e.g. the MAPQ) instead do not really fit this task.

adamnovak commented 2 years ago

It looks like about a third of your reads are unmapped in that GAM file, which is quite clearly worse than the SAM file.

We haven't really tried Giraffe on PGGB graphs; we've tuned it for Cactus/Minigraph-based graphs mostly. So I think you do need to look at the quality of the graph genome construction, or at least the degree to which it looks like Giraffe expects.

One metric of graph genome quality is of course how well reads can be mapped to it; that metric is pretty low here.

You can also look at the total number of bases in the graph (vg stats -l DB1.giraffe.gbz); it ought to be within maybe 10% of the total length of your linear reference. If it is much smaller, either the graph is missing sequence, or it has collapsed sequence together that maybe shouldn't be. If it is much larger, then pieces of your assemblies that correspond might not be being aligned.

Another way to evaluate the graph is to look at the GFA, or part of it, in a viewer like Bandage. Does it look like mostly-linear backbones that branch apart to denote variation and then come together again? That's what Giraffe expects. It's OK if the graph is fragmented, but if it is all tied together into one big hairball because e.g. mobile elements or centromeric sequences from different places are collapsed together, Giraffe might not work as well.

The graphs are also only as good as the assemblies; how good are the assemblies? If you want a good graph it can definitely help to make it with at least one reference-quality assembly.

lt11 commented 2 years ago

Hi Adam,

Just to be 100% sure I got it correctly: when you wrote "it ought to be within maybe 10% of the total length of your linear reference" you meant that the graph length should be between 90% and 110% of the linear reference?

I'm asking because I'm checking the graph length as a function of the parameters for graph construction. And those numbers make sense also in my case.

Thanks a lot,

Lorenzo

adamnovak commented 2 years ago

@lt11 Yes, that's what I mean. And it's just a guess; a graph with a lot more or less sequence might be fine, or it might mean that a lot of material represented in the linear reference is missing or duplicated. Even within that range, there's still the possibility of things that should be in the graph not being assembled, or things that should be single-copy getting duplicated.