nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
923 stars 708 forks source link

Possible redundant rowname assignment in tximport.r #1365

Closed pmoris closed 2 months ago

pmoris commented 2 months ago

I was running the tximport.r code manually while trying to wrap my head around how it works and how it relates to the different types of salmon quant files and gene_count files generated by the pipeline (see https://nfcore.slack.com/archives/CE8SSJV3N/p1724863202019919?thread_ts=1686156473.934039&cid=CE8SSJV3N). While doing so, I stumbled upon the following line of code that seems to be doing nothing?

https://github.com/nf-core/rnaseq/blob/b89fac32650aacc86fcda9ee77e00612a1d77066/bin/tximport.r#L114

As far as I can tell, gene_info is derived from transcript_info which contains the columns "gene_id", "gene_name", but not tx. So, at least when I was trying to reproduce the steps manually, the assignment did nothing (NULL was being assigned to the row names, changing nothing).

Related, the line just above it

https://github.com/nf-core/rnaseq/blob/b89fac32650aacc86fcda9ee77e00612a1d77066/bin/tximport.r#L113

also appears to be redundant. I believe gene_info just ends up holding the contents of transcript_info$gene. Or are there cases where the genes resulting from summarizeToGene can differ from those in transcript_info (which are ultimately derived from the tx2gene file)?

Caveat to all of the above: since I'm actively trying to understand the code and tximport methodology, I might be completely off the mark here ;)

If anyone can confirm, I'd be happy to make a PR for this.

pinin4fjords commented 2 months ago

Note that all of this logic is now moved to an nf-core module (see the dev branch). But the lines you highlight are still there, so I'll explain:

gene_info <- transcript_info\$gene[match(rownames(gi[[1]]), transcript_info\$gene[["gene_id"]]),]

This fetches the gene-level metadata table from the transcript_info list object, and also sorts it to match the gene order in the gi object. The gene metadata table is a uniquified subset of the gene-wise columns from the input metadata table.

So that line is not redundant.

But yes, this line probably is, maybe a relic of prior development. It's not doing any harm, but feel free to PR on nf-core/modules, and once it's in we can bump the module here.

pmoris commented 2 months ago

Cheers, thanks for the explainer!