nf-core / rnaseq

RNA sequencing analysis pipeline using STAR, RSEM, HISAT2 or Salmon with gene/isoform counts and extensive quality control.
https://nf-co.re/rnaseq
MIT License
881 stars 701 forks source link

Update SortMeRNA to use SilvaDB 138 (for commercial use) #570

Closed nh13 closed 3 years ago

nh13 commented 3 years ago

SilvaDB release 138 is now available for commercial use! See: https://www.arb-silva.de/silva-license-information/

drpatelh commented 3 years ago

Hi @nh13! Hope you are well! SortMeRNA is one of those tools for which I would like to plead ignorance because I have never used it 😅 How can we accommodate this information into the pipeline? I am aware of issues with run-times as highlighted here but that's off topic.

We do have a parameter that allows you to override the default databases you provide to the pipeline i.e. --ribo_database_manifest but I suspect that's off topic too?

So based on my deductions I am assuming you mean we change the sentences here and here?

nh13 commented 3 years ago

@drpatelh

It'd be great if either SortMeRNA could update them (see this issue), but for nf-core I'd expect to be able to use them for commercial use by default. Also, the SortMeRNA databases are very old 29/11/2014, but like you, I "neither have the time nor the inclination" to update them 😆 !

So why not just align to the full SilvaDB release 38, which allows for both commercial and non-commercial use by default? It is more comprehensive than the set up there? Perhaps some RNA-Seq analysis experts could weigh in?

drpatelh commented 3 years ago

I am fairly well versed on the dark side of RNA-seq analysis but I fear this issue falls into the even darker realm of classify my DNA/RNA-type voodoo magic. @apeltzer what do we need to sacrifice here?

@drejom !! Been a while!

drpatelh commented 3 years ago

I just saw that you edited the issue @drejom 😂 Fate...hope you are well!

drejom commented 3 years ago

I am! Just a pandemic and an insurrection between drinks! Looking forward to a UK visit….one day!

drpatelh commented 3 years ago

Ping @d4straub @apeltzer. Any ideas how we can incorporate this information into the pipeline? I am planning on getting a release together over the next couple of weeks. Can include this if it's an easy fix. Thanks!

apeltzer commented 3 years ago

@d4straub is the person to ask - not too much experience on SortMeRNA / SILVA either, sorry :-(

d4straub commented 3 years ago

Updating to v4.3.1 would improve runtime, see https://github.com/biocore/sortmerna/releases/tag/v4.3.1 The SILVA database might be also updated to v138 in v4.3.1, as earlier mentioned for 4.2 that "next release" would come with SILVA v138 . Will investigate this next week.

drpatelh commented 3 years ago

So I made a concerted effort to try and use the latest Biocontainer thinking I could just swap out the container and put my feet up because everything else with the process would just work. No no....a couple of hours later after having experienced Segmentation faults and various issues where downstream processes in the pipeline were failing due to corrupt fastq files being generated I gave up to do something else. I also tried to get it to generate uncompressed fastq's that I could zip after the process using the --zip-out parameter. The inline help comments are here but the value evaluation takes completely different types of parameters as defined here. I tried all of those values but no success. I may be missing something stupendously obvious here but it appears that it is going to be too much hassle than it's worth bumping the version on this but be great if someone else can confirm!

The module file is here

nh13 commented 3 years ago

It may be a better solution to just use bowtie/bwa/etc to align to the rRNA sequences directly and remove those that have any valid mappings. SortMeRNA is still quite slow.

drpatelh commented 3 years ago

Yup. The newer releases were supposed to address this but it appears that we are now just seeing a different set of issues😅

A metagenomics classifier type approach using Kraken2 would be quite cool too which would bypass the mapping and generate filtered fastqs directly - maybe not as sensitive as mapping if done loosely but would do the trick I think.

I used to run RNA-SeQC for the longest time to get rRNA estimates as a QC metric and then to deal with the counts appropriately downstream if required, before the differential analysis. This pipeline also generates a feature biotypes plot with this info in the MultiQC report. Personally, I think that is the best way and bypasses the need to do any FastQ filtering at all. It appears the links are broken on the RNA-SeQC website too - not doing very well. Time to shut the lid!

Have a good evening!

d4straub commented 3 years ago

It may be a better solution to just use bowtie/bwa/etc to align to the rRNA sequences directly and remove those that have any valid mappings. SortMeRNA is still quite slow.

This might work more or less for an isolate but not for environmental samples (i.e. a mixture of organisms with previously unknown rRNA sequences), here SortMeRNA has advantages. But this was my intention, to make this pipeline fit for metatranscriptomics when adding SortMeRNA.

Your tests @drpatelh suggest that it might be better to just stay with version 4.2.0 (despite being slow, but at least not breaking the pipeline, correct?) and attempt to just change the database to silva 138 to allow commercial use. Would that sound fine to you?

drpatelh commented 3 years ago

Your tests @drpatelh suggest that it might be better to just stay with version 4.2.0 (despite being slow, but at least not breaking the pipeline, correct?)

I think this may be the path of least resistance given that the latest release still seems quite buggy and most people aren't using this option when running the pipeline. It would be great if you have some time to confirm this is the case. Bumping the version in the SortMeRNA module and running nextflow run nf-core/rnaseq ..... -r dev should reproduce the errors. Don't worry if you don't have time.

Yup, if we can't update the software version maybe it is worth updating the SILVA databases which I assume are independent and won't break anything with the current tool version in the pipeline (or make it even sloooooooower)?

drpatelh commented 3 years ago

The latest version of SortMeRNA (v4.3.4) is now working smoothly via a simple update of the existing nf-core/module. It now also supports native compression of output files which is nice. I believe the databases have also been updated as of >4.2.0 as mentioned here so will close this issue!