vdblab / vdblab-shotgun

Shotgun metagenomic sequencing processing pipeline
MIT License
1 stars 1 forks source link

SortMeRNA Timeouts #75

Closed nickp60 closed 5 months ago

nickp60 commented 8 months ago

We have a small proportion of the samples failing the SortMeRNA step due to cluster timeouts. This step should run pretty fast, but in certain cases which we have not fully identified, it takes much longer. Here is what we know.

nickp60 commented 8 months ago

The most similar issue on their repo is this one: https://github.com/sortmerna/sortmerna/issues/368

nickp60 commented 8 months ago

I have some some (n=1) evidence that manually setting LSF_TMPDIR prior to bsubbing a job decreases the runtime by about 25%

nickp60 commented 8 months ago

Pre-indexing the reference seems to restore (at lease partial) logging to multicore bsubbed jobs. It fixes the hanging observed in their issue 364, but the logging doesn't appear realtime after that. It still hangs, just at a later step

miraep8 commented 7 months ago

Hello!

Just posting here some results I found from comparing sortmerna's performance with two alternatives: bbduk and RiboDetector. (with the idea in mind of potentially replacing sortmerna with another tool due to these issues).

Comparison of various tools for rRNA identification and sorting:

2.6.24

Tools under consideration:

Tested tools:

Other tools (I didn't test these after seeing the performance of bbduk):

Metrics for comparison:

Tests:

Results:

Time:

No doubt the bbduk pipeline is much faster than the other options!

Sample 1 (seconds) Sample 2 (seconds) Sample 3 (seconds)
Ribodetector 2528 3831 7443
bbduk 24 44 157
sortmerna 1690 2570 5042

Percent rRNA:

Sample 1 (% rRNA) Sample 2 (% rRNA) Sample 3 (% rRNA)
Ribodetector 50.31 43.2 0.37
bbduk 55.91 55.10 0.62
sortmerna 53.43 55.94 0.51

Conclusions:

Personally - for the mtx pipeline I plan to move forward with the bbduk option. (Though first I want to be sure that this approach is not accidentally calling mRNAs as rRNAs) The speed, combined with the fact that it seems to call approximately the same percentage of reads as the other options makes sense for that use case. I would expect the false positive rate (ie the rate at which mRNAs are accidentally called as rRNAs) to be rather low with this approach.

I suspect that the bbduk tool might work well for the other pipelines as well. I should note that the RiboDetector approach could definitely be improved upon! For example if we made use of the GPU acceleration it could get much faster, and if we were in a position of needed a potentially more conservative tool that could be a good option (or of course we could use bbduk with a more conservative reference file as well...).

Note that sortmerna and bbduk used the same reference file so they are the easiest to compare one to one, and it is interesting to me that sortmerna calls slightly fewer reads as rRNAs, that is almost the opposite of what I would have naively expected, though perhaps this can be tuned with different values of e.


And yes - also noting here that as Nick pointed out the tests may indicate one explanation for sortmerna's failure could simply be that it takes too long on samples with a high percentage of rRNA. Which underscores that it likely wouldn't work as well for the metatranscriptomics samples, but might still work ok for metagenomics samples. Just noting as part of the food for thought!


Cheers!

nickp60 commented 5 months ago

Closing this issue for now; the long-running datasets in question appear to be 16S data mislabeled at shotgun metagenomes.