Open biozzq opened 2 years ago
Hi,
Which metrics did you take into account to compare the mapping on the graph (vg map, default settings) vs the mapping on the linear genome (with bwa, default settings)? And which software/settings did you use? Last but not least, which organism? I made some tests (with yeast sequences) using the percentage of mapped reads and the percentage of perfect mappings. In both cases the graph worked better.
Best,
Lorenzo
Dear @lt11
Thank you. I used following command to map the short reads to the graph genome which was constructed by using pggb. The graph genome was constructed using the two sets of partially phased contigs generated by hifisam. When mapping to linear reference genome, the BWA-MEM default settings were used. The species i analyzed is the pig. In my side, I'm a little concerned about the quality of the graph genome. So, I want to know how to evaluate it. Thank you for your help.
vg (version v1.38.0) autoindex -g DB1.graph.fa.8c34847.4030258.a9c39c9.smooth.gfa -w giraffe -t 4 -T ./ -p DB1
vg giraffe -Z DB1.giraffe.gbz -m DB1.min -d DB1.dist -f DB_R1.fq.gz -f DB_R2.fq.gz > DB1.gam
Sincerely, Zheng zhuqing
So our approaches are quite different. But I would like to stress that the benefits of using a graph can be quantified using a number of measures (e.g. the total number of mapped reads, the number of reads with perfect matches, …) or specific cases (e.g. regions which are not included in the linear reference). I'm afraid that other measures (e.g. the MAPQ) instead do not really fit this task.
It looks like about a third of your reads are unmapped in that GAM file, which is quite clearly worse than the SAM file.
We haven't really tried Giraffe on PGGB graphs; we've tuned it for Cactus/Minigraph-based graphs mostly. So I think you do need to look at the quality of the graph genome construction, or at least the degree to which it looks like Giraffe expects.
One metric of graph genome quality is of course how well reads can be mapped to it; that metric is pretty low here.
You can also look at the total number of bases in the graph (vg stats -l DB1.giraffe.gbz
); it ought to be within maybe 10% of the total length of your linear reference. If it is much smaller, either the graph is missing sequence, or it has collapsed sequence together that maybe shouldn't be. If it is much larger, then pieces of your assemblies that correspond might not be being aligned.
Another way to evaluate the graph is to look at the GFA, or part of it, in a viewer like Bandage. Does it look like mostly-linear backbones that branch apart to denote variation and then come together again? That's what Giraffe expects. It's OK if the graph is fragmented, but if it is all tied together into one big hairball because e.g. mobile elements or centromeric sequences from different places are collapsed together, Giraffe might not work as well.
The graphs are also only as good as the assemblies; how good are the assemblies? If you want a good graph it can definitely help to make it with at least one reference-quality assembly.
Hi Adam,
Just to be 100% sure I got it correctly: when you wrote "it ought to be within maybe 10% of the total length of your linear reference" you meant that the graph length should be between 90% and 110% of the linear reference?
I'm asking because I'm checking the graph length as a function of the parameters for graph construction. And those numbers make sense also in my case.
Thanks a lot,
Lorenzo
@lt11 Yes, that's what I mean. And it's just a guess; a graph with a lot more or less sequence might be fine, or it might mean that a lot of material represented in the linear reference is missing or duplicated. Even within that range, there's still the possibility of things that should be in the graph not being assembled, or things that should be single-copy getting duplicated.
Dear all,
When comparing graph genome and linear genome in short reads mapping, I found that graph genome did not improve the short reads mapping. Has this phenomenon also occurred in your side? Furthermore, I wonder that how can i evaluate the quality of the graph genome contruction, and which indices can be used as a good measure of the quality. Thank you in advance.
Sincerely, Zheng zhuqing