nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
864 stars 692 forks source link

SortMeRNA database version #1354

Open wzheng0520 opened 4 weeks ago

wzheng0520 commented 4 weeks ago

Description of feature

Hi,

Thanks for providing this wonderful pipeline for using!

After digging some information in sortmerna and this pipeline, I have an curious about defualt SortMeRNA database version we are currently using.

Based on https://github.com/nf-core/rnaseq/blob/master/assets/rrna-db-defaults.txt, the rRNA database pointed into SortMeRNA old rRNA database version (SILVA 119). However, starting from version 4.3.4 in sortmerna, they started to allow to use newer SILVA database version (SILVA 138), which could allow commercial using. However, based on my understanding, if we want to use those new generated rRNA database, we might need to download them at first and then applied into sortmerna. I am wondering is there any plan to update the default database on RNA-seq pipeline and allow the newer SILVA database could be applied?

Sincerely Winnie Zheng

MatthiasZepper commented 3 weeks ago

As far as SortMeRNA itself is concerned, the current module version is 3.4.6, but 3.4.7 is out. So it would be very welcome, if you would update the module to use the latest version. Then it will most likely be updated to the latest version when the next pipeline release is due.

the rRNA database pointed into SortMeRNA old rRNA database version (SILVA 119). However, starting from version 4.3.4 in sortmerna, they started to allow to use newer SILVA database version (SILVA 138), which could allow commercial using.

Sorry, I can't follow here. Indeed, we are pointing to the references for version 4.3.4 in the pipeline, but it doesn't seem that those files have been updated for the last five years. One can also always supply the ribo_database_manifest parameter to specify their own one.

Did you consider a reference from another source? And how would commercial use matter - because of some restrictions on the SILVA database?

wzheng0520 commented 3 weeks ago

Hi Matthias,

Thanks for your quick replying!

I wanted to let you know that as of SILVA database version 138 or newer, there are no longer any licensing restrictions on commercial use, according to SILVA's licensing information.

Furthermore, SortMeRNA has updated their database builds based on SILVA 138. Although these are not included in the usual rRNA databases on GitHub, SortMeRNA has provided a download link to the newer SILVA database version in response to an issue ticket SortMeRNA issue #282.

The new rRNA content databases include:

smr_v4.3_default_db.fasta smr_v4.3_fast_db.fasta smr_v4.3_sensitive_db_rfam_seeds.fasta smr_v4.3_sensitive_db.fasta These updates should be useful for your current work.

MatthiasZepper commented 3 weeks ago

Ah, I see! They now distribute the references as an extra asset in selected releases instead of committing them to the main repo. Thanks for pointing this out!

For release 3.16 of the pipeline, we should indeed look into this, but since the references are compressed into an archive, we can't just update the paths, but would need to implement a download and extraction step in the pipeline (or submit the uncompressed versions to our nf-core test data repo respectively mirror them on AWS). I will add this to the roadmap.