SortMeRNA Timeouts - Githubissues

nickp60 commented 9 months ago

We have a small proportion of the samples failing the SortMeRNA step due to cluster timeouts. This step should run pretty fast, but in certain cases which we have not fully identified, it takes much longer. Here is what we know.

this appears to happen in very few samples (8/ ~2k)
it can be resolved by increasing the number of shards, which decreases the size of the files provided to sortmerna
it can be resolved by converting fastqs of problematic samples to fasta; it is unclear whether this is due to something in how it handles sequence quality, or due it it reducing the input file size
Logging in SortMeRNA hangs when jobs are submitted with bsub
- but logging works fine if it is submitted with a single thread
- and works fine when bsubing an interactive job and running the same command

nickp60 commented 9 months ago

The most similar issue on their repo is this one: https://github.com/sortmerna/sortmerna/issues/368

nickp60 commented 9 months ago

I have some some (n=1) evidence that manually setting LSF_TMPDIR prior to bsubbing a job decreases the runtime by about 25%

nickp60 commented 9 months ago

Pre-indexing the reference seems to restore (at lease partial) logging to multicore bsubbed jobs. It fixes the hanging observed in their issue 364, but the logging doesn't appear realtime after that. It still hangs, just at a later step

miraep8 commented 8 months ago

Hello!

Just posting here some results I found from comparing sortmerna's performance with two alternatives: bbduk and RiboDetector. (with the idea in mind of potentially replacing sortmerna with another tool due to these issues).

Comparison of various tools for rRNA identification and sorting:

2.6.24

Tools under consideration:

Tested tools:

SortMeRNA
- - Somewhat sporadically mainined with some potential issues we have run into.
- + Despite recent issues has been around for a while and is a common suite of tools that other packages compare themselves to.
- ~ It better able to handle mismatches. If the main goal is removing most rRNA samples before proceeding with analysis, this may not be needed and the speed costs may not be worth it. More useful for applications where you want to seperate the rRNA and don't want to risk missing any.
RiboDetector
- - A newer tool and doesn't really seem to be widely used (yet - and could be more a reflection of time than merit).
- + open source and is still maintained.
- + Can be used out of the box without need to download a db.
bbduk
- + Part of the popular/well maintained BBTools suite.
- + Recommended tool on a lot of the biotools forums.

Other tools (I didn't test these after seeing the performance of bbduk):

Infernal
- + Has been around for a while and still is maintained and updated.
- + Folks comment that it is a bit slow. And might be overkill for our purposes.
barrnap
- - Doesn't seem to be actively maintained.

Metrics for comparison:

Time - just using the built in time function in shell to estimate how long each execution takes.
Consistency with one another. - currently just plotting ther % of each test file that is called as rRNA/not rRNA. Since I have kept the files we could also test on a read by read basis and see how much agreement there is.

Tests:

These pilot test were just run on three samples:
- one of Tyler's samples from project 3 that was running into issues with SortMeRNA,
- one of Madhu's metatranscriptomic samples which has a lot of rRNA.
- one of Pamela's metagenomic samples that didn't run into any issues with SortMeRNA/shouldn't have an overabundance of rRNA.
Setup:
- each tool has its own custom singularity container used to run all the tests.
- for the tools which require a reference (sortmerna and bbduk) I used a collection of ribosomal kmers that Brian Bushnell put together a few years ago (linked here). This was just for preliminary analysis. I am also in the process of created an updated db we could use by making use of the SILVA db entries (for both the LSU and SSU) for the most common microbiota phylums.

Results:

Time:

No doubt the bbduk pipeline is much faster than the other options!

	Sample 1 (seconds)	Sample 2 (seconds)	Sample 3 (seconds)
Ribodetector	2528	3831	7443
bbduk	24	44	157
sortmerna	1690	2570	5042

Percent rRNA:

	Sample 1 (% rRNA)	Sample 2 (% rRNA)	Sample 3 (% rRNA)
Ribodetector	50.31	43.2	0.37
bbduk	55.91	55.10	0.62
sortmerna	53.43	55.94	0.51

Conclusions:

Personally - for the mtx pipeline I plan to move forward with the bbduk option. (Though first I want to be sure that this approach is not accidentally calling mRNAs as rRNAs) The speed, combined with the fact that it seems to call approximately the same percentage of reads as the other options makes sense for that use case. I would expect the false positive rate (ie the rate at which mRNAs are accidentally called as rRNAs) to be rather low with this approach.

I suspect that the bbduk tool might work well for the other pipelines as well. I should note that the RiboDetector approach could definitely be improved upon! For example if we made use of the GPU acceleration it could get much faster, and if we were in a position of needed a potentially more conservative tool that could be a good option (or of course we could use bbduk with a more conservative reference file as well...).

Note that sortmerna and bbduk used the same reference file so they are the easiest to compare one to one, and it is interesting to me that sortmerna calls slightly fewer reads as rRNAs, that is almost the opposite of what I would have naively expected, though perhaps this can be tuned with different values of e.

And yes - also noting here that as Nick pointed out the tests may indicate one explanation for sortmerna's failure could simply be that it takes too long on samples with a high percentage of rRNA. Which underscores that it likely wouldn't work as well for the metatranscriptomics samples, but might still work ok for metagenomics samples. Just noting as part of the food for thought!

Cheers!

nickp60 commented 6 months ago

Closing this issue for now; the long-running datasets in question appear to be 16S data mislabeled at shotgun metagenomes.

vdblab / vdblab-shotgun

SortMeRNA Timeouts #75

Comparison of various tools for rRNA identification and sorting:

2.6.24

Tools under consideration:

Tested tools:

Other tools (I didn't test these after seeing the performance of bbduk):

Metrics for comparison:

Tests:

Results:

Time:

Percent rRNA:

Conclusions: