Inconsistencies in hairpins aligned read numbers

cguyomar commented 3 years ago

Hi!

Check Documentation

I have checked the following places for your error:

Description of the bug

In the multiqc report, the number of reads aligned to the hairpin database differs slightly from the number of reads that did not map on the matures. In our real life example we got a ~2M reads difference : Here, we got 22.1M reads not mapped on the mature database. However, the report states that 23.9M sequences were aligned against the hairpin database.

Steps to reproduce

This can be reproduced with the test profile, in a lesser extent and it's not visible in the report because of the low amount of reads :

$ nextflow run nf-core/smrnaseq -profile test,singularity
$ cd work/xx/xxxxx # directory for a hairpin alignment task
$ echo $(zcat sample_1.mature_unmapped.fq.gz | wc -l)/4 | bc
20399

$ samtools flagstat sample_1.hairpin.bam
20411 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
29 + 0 mapped (0.14% : N/A)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (N/A : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (N/A : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Expected behaviour

flagstat shoudl report the same number of reads. This seems to be a problem of multiple alignments incorrectly flagged in the bam file. I assume this also slightly impacts quantification

Log files

Have you provided the following extra information/files:

[x] The command used to run the pipeline
[x] The .nextflow.log file : https://bin.infini.fr/?5e3f1f0f77e33c5e#P1zuP7yGCtUHDfoGiUs25qv62t3g6caHekafwWGvm8B

System

Hardware: Laptop + slurm cluster
Executor: local + slurm
OS: Ubuntu + CentOS

Nextflow Installation

Version: 20.10.0.5430

Container engine

Engine: Singularity
Image tag: nfcore/smrnaseq:1.1.0

nschcolnicov commented 2 weeks ago

I'm looking into this. I agree, since the mapping against the hairpin database uses the unmapped reads from mature as input, after aligning the reads against the mature database, then we should expect that the value in the "M Total Seqs" for hairpin should be identical to the total number of unmapped reads from mature (calculated by taking "M Total Seqs - M Reads Mapped".

nschcolnicov commented 2 weeks ago

@apeltzer @cguyomar @atrigila

Overview of the MIRNA_QUANT Subworkflow

BOWTIE_MAP_MATURE takes mature.fa index files and ch_reads_mirna (FASTQ reads).
BOWTIE_MAP_HAIRPIN takes hairpin.fa index files and ch_reads_hairpin (unmapped reads from bowtie_map_mature).

Flagstat Observations:

When running the test_full profile and looking at the MultiQC file for the c1 sample:

c1_mature: Total sequences: 11.1m (from mature.flagstat file: 11,056,266), mapped: 2.3m (from mature.flagstat file: 2,264,859). Unmapped reads should be 8,791,407 (total - mapped).
- Manual count of reads in unmapped.fq using seqtk confirmed this exact number: 8,791,407 reads. I used: seqtk seq -A c1_mature_unmapped.fq.gz | grep -c "^>
c1_mature_hairpin: Total sequences: 8,917,361 (from mature_hairpin.flagstat file), mapped: 0.5m (from mature_hairpin.flagstat file: 534,537).

Discrepancy:

The total reads reported in the BAM file after mapping with hairpin.fa are higher (8,917,361) than the reads in the original FASTQ file (8,791,407).

Validation:

I confirmed that the mature unmapped FASTQ file used as input for hairpin was being staged properly, the one outputted by the mature alignment, and the one used as input for hairpin alignment were the same:

MD5 checksum for c1_mature_unmapped.fq.gz in both bowtie_map_mature and bowtie_map_hairpin directories:
- 5de444ea52ad9c982c84c1e5ba98788a c1_mature_unmapped.fq.gz

Explanation:

The increase in read count is due to secondary alignments being reported. A Biostars discussion confirmed that secondary alignments often inflate the number of reads reported.

Testing with Different -k Values:

To validate, I re-ran the pipeline with -k 2 instead of -k 50 in the Bowtie modules, and the total number of reads reported in the flagstat file decreased (even though it is still higher than the actual number of reads in the input fastq):

For mature alignment:
- Total reads: 11,051,538
- Mapped reads: 2,260,131
For hairpin alignment:
- Total reads: 8,895,552
- Mapped reads: 512,728

Short Explanation:

The -k option in Bowtie specifies the maximum number of alignments to report per read. Higher values (e.g., -k 50) can result in multiple alignments for a single read, inflating the read count in the BAM file and causing samtools flagstat to report more reads than in the original FASTQ file.

Solution

I don't think we should do anything to modify this behavior, since it is the way we do it on all of the other nf-core pipelines. However, if we really wanted to make the numbers match, we could update the value in the flagstats report generated with the one that we get when calculating the total number of reads in the fastq file, manually.

Let me know what you think!

apeltzer commented 2 weeks ago

Agreed, this should not be "fixed" as its just regular behaviour which can be explained. We may add this summary of yours to the docs to explain, e.g. in a small FAQ?

cguyomar commented 2 weeks ago

Totally agree, thanks for investigating @nschcolnicov

lpantano commented 1 week ago

we could add to samtools stats some filtering options to counts only the primary hits? I think we could do that with extra args for the module and make the stats are more precise. I think that is what we did before we migrated to nfcore. And maybe only make these files available to multiqc for stats instead the once coming from idxstats or flagstats.

Not a big deal though, agree.

apeltzer commented 1 week ago

Could do that and document that this is done so that users understand it / can read up on it, fully agree 👍🏻

nf-core / smrnaseq