Questions about read alignment in GAM and BAM format

b524198065 commented 3 years ago

Dear developers:

Hello and recently I was using VG to make a variant graph of a plant species and I want to compare the number of reads of one accession mapped onto the graph-based genome via vg giraffe and the linear reference genome via bwa-mem.

The read mapping statistics of bwa-mem was compute by samtools flagstat. However, when I use -o GAM and -o BAM parameters in vg giraffe mapping tools and use vg stats -a .gam and samtools flagstat bam, they returned different statistics regrading total alignments, total mapped reads, etc. How can I make a sensible statistics about read mapping using the graph-based genome?

Appreciated, Hongbo

jeizenga commented 3 years ago

Could you share the specific results you get from the two methods? That might help us narrow down the possibilities.

b524198065 commented 3 years ago

Well the two resulted alignments are in big sizes (several Giga Bytes), is there any proper way to share them?

jeizenga commented 3 years ago

Sorry, I wasn't very clear about that. What I meant was: could you share the specific vg and samtools commands you used and also what the mismatching output was?

b524198065 commented 3 years ago

OK, my vg giraffe command lines are (for generating alignments in BAM and GAM formats, respectively):

vg giraffe -t 100 -o BAM -f C408/_1.clean.fq.gz -f C408/_2.clean.fq.gz -x test1.xg -g test1.gg -H test1.gbwt -m test1.min -d test1.dist > testgiraffe1_C408.bam vg giraffe -t 100 -o GAM -f C408/_1.clean.fq.gz -f C408/_2.clean.fq.gz -x test1.xg -g test1.gg -H test1.gbwt -m test1.min -d test1.dist > testgiraffe1_C408.gam

My mapping statistics command lines are:

samtools flagstat testgiraffe1_C408.bam > testgiraffe1_C408.bam.stat vg stats -a testgiraffe1_C408.gam > testgiraffe1_C408.gam.stat

Here are the *.stat results for BAM and GAM, and I use asterisks to mark the conflict statistics results: By the way, what does "Total perfect" means in GAM output?

############ BAM ############## 83421910 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates

56002323 + 0 mapped (67.13% : N/A) 83421910 + 0 paired in sequencing 41710955 + 0 read1 41710955 + 0 read2 53210516 + 0 properly paired (63.78% : N/A) 55016828 + 0 with itself and mate mapped 985495 + 0 singletons (1.18% : N/A) 69722 + 0 with mate mapped to a different chr 42406 + 0 with mate mapped to a different chr (mapQ>=5)

############ GAM ############## Total alignments: 83421910 Total primary: 83421910 Total secondary: 0

Total aligned: 81088548 Total perfect: 7438421 Total gapless (softclips allowed): 59361421 Insertions: 36500224 bp in 13971863 read events Deletions: 52716761 bp in 17928758 read events Substitutions: 411606257 bp in 411606257 read events Softclips: 862536981 bp in 21071526 read events

Thanks

jeizenga commented 3 years ago

In VG, "perfect" means that there are no mismatches, indels, or soft clips in the alignment. The discrepancy in the number of mapped reads is unusual though. Are you using a graph that has many large insertions or highly divergent sequences? One way that a read can be mapped in the GAM but not in the BAM is if the graph mapping has no overlap with the BAM's sequences (i.e. the SQ lines from the header). In those cases, there is no unambiguous conversion from the graph alignment to the linear alignment, and the read will be reported as unmapped.

b524198065 commented 3 years ago

Hi, thanks for your reply. The graph is actually complex with many divergent sequences/insertions. And I used bwa-mem to map the same reads to the linear reference genome and used samtools flagstat to generate a statistics, which returned very different results compared with vg giraffe BAM output. Here is the flagstat results:

As you can see the total alignments and total aligned reads were both higher than those in vg GAM/BAM statistics. Is it proper to make a direct comparison between bwa and vg output mapping results? Or if you have some other methods to indicate that the number of reads (PE) that can be properly mapped in graph-based genome is greater than that in linear reference genome?

Thanks

####### BWA-MEM BAM 86562877 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 3140967 + 0 supplementary 0 + 0 duplicates 85361415 + 0 mapped (98.61% : N/A) 83421910 + 0 paired in sequencing 41710955 + 0 read1 41710955 + 0 read2 66652608 + 0 properly paired (79.90% : N/A) 81704008 + 0 with itself and mate mapped 516440 + 0 singletons (0.62% : N/A) 12151794 + 0 with mate mapped to a different chr 6415031 + 0 with mate mapped to a different chr (mapQ>=5)

jeizenga commented 3 years ago

We've added an annotation for perfect pairing into GAM alignments from vg map and vg mpmap if you rebuild from our master branch and realign. The count will be reported using vg stats -a as well. Will that be enough for your use case?

netwon123 commented 1 year ago

Hello, could u have standarded comparison between the result of vg giraffe and bwa ?

vgteam / vg

Questions about read alignment in GAM and BAM format #3190