Open cguyomar opened 3 years ago
I'm looking into this. I agree, since the mapping against the hairpin database uses the unmapped reads from mature as input, after aligning the reads against the mature database, then we should expect that the value in the "M Total Seqs" for hairpin should be identical to the total number of unmapped reads from mature (calculated by taking "M Total Seqs - M Reads Mapped".
@apeltzer @cguyomar @atrigila
mature.fa
index files and ch_reads_mirna
(FASTQ reads).hairpin.fa
index files and ch_reads_hairpin
(unmapped reads from bowtie_map_mature).When running the test_full
profile and looking at the MultiQC file for the c1
sample:
c1_mature
: Total sequences: 11.1m (from mature.flagstat
file: 11,056,266), mapped: 2.3m (from mature.flagstat
file: 2,264,859). Unmapped reads should be 8,791,407 (total - mapped).
unmapped.fq
using seqtk
confirmed this exact number: 8,791,407 reads. I used:
seqtk seq -A c1_mature_unmapped.fq.gz | grep -c "^>
c1_mature_hairpin
: Total sequences: 8,917,361 (from mature_hairpin.flagstat
file), mapped: 0.5m (from mature_hairpin.flagstat
file: 534,537).
The total reads reported in the BAM file after mapping with hairpin.fa
are higher (8,917,361) than the reads in the original FASTQ file (8,791,407).
I confirmed that the mature unmapped
FASTQ file used as input for hairpin
was being staged properly, the one outputted by the mature alignment, and the one used as input for hairpin alignment were the same:
c1_mature_unmapped.fq.gz
in both bowtie_map_mature
and bowtie_map_hairpin
directories:
5de444ea52ad9c982c84c1e5ba98788a c1_mature_unmapped.fq.gz
The increase in read count is due to secondary alignments being reported. A Biostars discussion confirmed that secondary alignments often inflate the number of reads reported.
To validate, I re-ran the pipeline with -k 2
instead of -k 50
in the Bowtie modules, and the total number of reads reported in the flagstat
file decreased (even though it is still higher than the actual number of reads in the input fastq):
For mature
alignment:
For hairpin
alignment:
The -k
option in Bowtie specifies the maximum number of alignments to report per read. Higher values (e.g., -k 50
) can result in multiple alignments for a single read, inflating the read count in the BAM file and causing samtools flagstat
to report more reads than in the original FASTQ file.
I don't think we should do anything to modify this behavior, since it is the way we do it on all of the other nf-core pipelines. However, if we really wanted to make the numbers match, we could update the value in the flagstats report generated with the one that we get when calculating the total number of reads in the fastq file, manually.
Let me know what you think!
Agreed, this should not be "fixed" as its just regular behaviour which can be explained. We may add this summary of yours to the docs to explain, e.g. in a small FAQ?
Totally agree, thanks for investigating @nschcolnicov
we could add to samtools stats some filtering options to counts only the primary hits? I think we could do that with extra args for the module and make the stats are more precise. I think that is what we did before we migrated to nfcore. And maybe only make these files available to multiqc for stats instead the once coming from idxstats or flagstats.
Not a big deal though, agree.
Could do that and document that this is done so that users understand it / can read up on it, fully agree 👍🏻
Hi!
Check Documentation
I have checked the following places for your error:
Description of the bug
In the multiqc report, the number of reads aligned to the hairpin database differs slightly from the number of reads that did not map on the matures. In our real life example we got a ~2M reads difference : Here, we got 22.1M reads not mapped on the mature database. However, the report states that 23.9M sequences were aligned against the hairpin database.
Steps to reproduce
This can be reproduced with the test profile, in a lesser extent and it's not visible in the report because of the low amount of reads :
Expected behaviour
flagstat shoudl report the same number of reads. This seems to be a problem of multiple alignments incorrectly flagged in the bam file. I assume this also slightly impacts quantification
Log files
Have you provided the following extra information/files:
.nextflow.log
file : https://bin.infini.fr/?5e3f1f0f77e33c5e#P1zuP7yGCtUHDfoGiUs25qv62t3g6caHekafwWGvm8BSystem
Nextflow Installation
Container engine