How to use `--transcript_length_matrix` with nf-core/rnaseq `--aligner star_rsem` output?

olgabot commented 3 months ago

Description of the bug

Hello, Hope you are doing well! I'm wondering how to take advantage of the transcript length feature (https://github.com/nf-core/differentialabundance/pull/203) when using gene counts created by RSEM. I prefer RSEM as an aligner as I find it to be more specific -- we had some RNA-seq data with plasmids only in certain conditions, and those plasmids got >0 counts in the conditions WITHOUT the plasmids when using salmon :( and didn't have the same issues with RSEM -- but I can't find a *.gene_lengths.tsv file created by RSEM.

From nf-core/rnaseq documentation

STAR via RSEM

Output files

star_rsem/

rsem.merged.gene_counts.tsv: Matrix of gene-level raw counts across all samples.

rsem.merged.gene_tpm.tsv: Matrix of gene-level TPM values across all samples.

rsem.merged.transcript_counts.tsv: Matrix of isoform-level raw counts across all samples.

rsem.merged.transcript_tpm.tsv: Matrix of isoform-level TPM values across all samples.

.genes.results: RSEM gene-level quantification results for each sample.

.isoforms.results: RSEM isoform-level quantification results for each sample.

.STAR.genome.bam: If -save_align_intermeds is specified the original BAM file containing read alignments to the reference genome will be placed in this directory.

.transcript.bam: If -save_align_intermeds is specified the original BAM file containing read alignments to the transcriptome will be placed in this directory.

star_rsem/<SAMPLE>.stat/

.cnt, .model, .theta: RSEM counts and statistics for each sample.

star_rsem/log/

.log: STAR alignment report containing the mapping results summary.

RSEM is a software package for estimating gene and isoform expression levels from RNA-seq data. It has been widely touted as one of the most accurate quantification tools for RNA-seq analysis. RSEM wraps other popular tools to map the reads to the genome (i.e. STAR, Bowtie2, HISAT2; STAR is used in this pipeline) which are then subsequently filtered relative to a transcriptome before quantifying at the gene- and isoform-level. Other advantages of using RSEM are that it performs both the alignment and quantification in a single package and its ability to effectively use ambiguously-mapping reads.

You can choose to align and quantify your data with RSEM by providing the --aligner star_rsem parameter.

How do you advise creating a --transcript_length_matrix file from the RSEM data? Or should we use the TPMs in this case?

Thank you!

Command used and terminal output

No response

Relevant files

No response

System information

No response

olgabot commented 3 months ago

Whelp, I solved my own problem by actually reading the tximport documentation for RSEM. In case it helps someone else, here's the code snippet to get the lengths from the RSEM output from nf-core/rnaseq:

library(tximport)
files = Sys.glob('star_rsem/*.genes.results')
names = c()
for (filename in files) { names = c(names, strsplit(strsplit(filename, '/')[[1]][2], '.genes')[[1]][1]) }
names(files) = names
txi.rsem <- tximport(files, type = "rsem", txIn = FALSE, txOut = FALSE)
head(txi.rsem$length)
write.table(txi.rsem$length, 'rsem_gene_lengths.tsv', sep='\t')

WackerO commented 2 months ago

Hey Olga, glad you solved the issue! Closing this then 🙂

nf-core / differentialabundance