rajewsky-lab / mirdeep2

Discovering known and novel miRNAs from small RNA sequencing data
GNU General Public License v3.0
141 stars 49 forks source link

Difference in mapping rate when aligned with mapper.pl of miReep2 and bowtie1 #123

Closed g0656116 closed 5 months ago

g0656116 commented 5 months ago

I performed alignment using the same small RNASeq data and reference genome using miRDeep2 and bowtie1, respectively. It is known that miRDeep2 uses bowtie1 for mapping, but the results using miRDeep2 and the results using bowtie1 show very different mapping rates. While approximately 20% of the results for miRDeep2 are mapped, 85% of the results for bowtie1 are mapped.

The options used are as follows. miRDeep2 mapper.pl 2023.fastq -e -h -i -j -m -p /home/song/miRNAseq/bowtie_index/STAR/index -s collapsed_2023.fa -t genome_after_2023.arf -v -o 1 bowtie1 bowtie -v 1 -p 4 -S /home/song/miRNAseq/bowtie_index/STAR/index 2023.fastq > ./hg38/38_aligned_data.sam

What's the difference? I am curious about how to increase the mapping rate when using miRDeep2 and how to analyze the level of miRNA expression using the sam file, which is the result file of bowtie.

mschilli87 commented 5 months ago

My guess is you aligned the raw reads using bowtie while miRDeep2 collapses identical reads to speed up the mapping. In sRNA-seq, the exact same read can occur thousands of times (e.g. a highly abundant miRNA + adapters). If you run bowtie that exact sequence is mapped again and again over and over again and each time it counts as a mapped read. In the miRDeep2 flow, this sequence is mapped (anc counted as mapping) only once. Not that for quantification purposes, miRDeep2 considers the read multiplicity in downstream steps.

g0656116 commented 5 months ago

Thank you very much for the quick reply. Can I ask you a few more questions?

  1. I am analyzing using miRNAs_expressed_all_samples.csv, which is output from quantifier.pl. Can I perform post-analysis with this file? Was this file created taking collapsed identical reads (read multiplicity) into account?

  2. And when I run quantifier.pl, ── expression_analyses │ └── expression_analyses_1717759024 │ ├── bowtie_mature.out │ ├── bowtie_reads.out │ ├── collapsed_2023.fa.converted │ ├── collapsed_2023.fa_mapped.arf │ ├── collapsed_2023.fa_mapped.bwt │ ├── expression_1717759024.html │ ├── mature.converted │ ├── mature.fa_mapped.arf │ ├── mature.fa_mapped.bwt │ ├── mature2hairpin │ ├──miRBase.mrd │ ├──miRNA_expressed.csv │ ├──miRNA_not_expressed.csv │ ├── miRNA_precursor.1.ebwt │ ├── miRNA_precursor.2.ebwt │ ├── miRNA_precursor.3.ebwt │ ├── miRNA_precursor.4.ebwt │ ├── miRNA_precursor.rev.1.ebwt │ ├── miRNA_precursor.rev.2.ebwt │ ├── precursor.converted │ ├── precursor_not_expressed.csv │ ├── read_occ │ └── rna.ps

    miRNAs_expressed_all_samples_1717759024.csv expression_1717759024.html

This output is generated. Why can't I generate output like "expression_analyses/expressionanalyses.csv miRNAs_expressed_all_samplesnormalized.csv"?

Thank you for your reply. It's really helpful for my research.

mschilli87 commented 5 months ago
  1. Yes & yes.
  2. Not 100% sure I got the question but it seems you are looking for -y:
    quantifier.pl [...] -y foo

    will results in file names like expression_foo.html & miRNAs_expressed_all_samples_foo.csv.

g0656116 commented 5 months ago

thank you so much It was a great help I'll come back if I have any other questions!