Run time with 7,000 viral genomes

cerebis commented 2 years ago

Hello, I have been attempting to analyse a block of metagenomically derived putative viral genomes. These have been curated to a certain extent, so the confidence in the set is reaonably good.

That said, I gave this job to a physical machine with 32 cores and 250 GB of memory. The job has currently accumulated 90 hours of wall time and 125 hrs of CPU time. Although I have specified 32 cpus, I was hoping there would be more effective use of the concurrency available -- ie maybe a ratio of CPUtime/walltime ~ 20.

The pipeline has been processing the largest directory named no_end_contigs_with_viral_domain and most of that time seems to have been spent on steps involving Phanotate ORF prediction. I suppose it will follow this with anotation? It has already completed the same with Prodigal I think -- you're thorough!

Anyway, I had to have a look at the main script cenote-taker2.1.3.sh and I would like to ask if you have ever considered converting these steps to Nextflow? You would achieve reliable parallelisation, that could be further tuned per-step depending on IO, memory and CPU bounds of the related programs. You would also be able to keep a lot of the complex bash-fu as it is currently. The pipeline would also be easily moved onto HPCs through different queuing systems.

I would offer you my time to do so, but I am pretty limited at present.

mtisza1 commented 2 years ago

Hi and thanks for your interest in the tool!

So it seems like you're probably running annotation mode -am True on 7000 putative virus contigs. This is great, but perhaps too much to handle for your CPUs and walltime allocation.

I was curious why the PHANOTATE step was taking so long, and I've been running some tests. It turns out a "for loop" directly proceeding the PHANOTATE/transeq step (to format the fasta files correctly) turns out to be the issue. It works, but it is very inefficient, and it runs into time issues with huge virus datasets. I have usually split my Cenote-Taker 2 runs by SRA run/sample, so it hasn't exceeded a few hundred virus contigs generally and this for loop hasn't caused any problems for me.

I've replaced this for loop with a bioawk one liner which reduces the time for this step dramatically. I'm going to test this a little more and push the changes tomorrow afternoon (Eastern USA time).

Also yes, ORF functional annotation will follow the PHANOTATE/Prodigal step.

When I did most of the coding for this tool, I wasn't really aware of Nextflow or SnakeMake. If I have time for a major major update in the future (Cenote-Taker 3??), I'll certainly use one of these to manage/parallelize the steps. Currently my priorities are elsewhere, but I'm committed to maintaining Cenote-Taker 2 as functional tool.

I'm sorry that you've put so many hours of your machine into it, but my guess is that this run will not finish (and I don't have a "continue option" at the moment). When I update this troublesome for loop, I would probably suggest running 1000 genomes and see how long that takes, then partition the rest as necessary. Annotation, in general, is very computationally intensive, and, while I've done my best to parallelize this tool, (as we've seen) it's still not perfect in that regard. I will say that runtime decreases with more CPUs, but it seems you are already running all your available resources.

Best,

Mike

mtisza1 commented 2 years ago

Update:

I've fixed the piece of code that was causing some undue delay. I do hope it will allow you to run all of your virus contigs without too much pain. I still suggest that you start with 1000 or so genomes as a time benchmark. (My guess is that it will still take about 10 hours with 32 CPUs).

To update the script:

cd Cenote-Taker2
git pull

Let me know if you run into any additional issues!

Regards,

Mike

mtisza1 / Cenote-Taker2

Run time with 7,000 viral genomes #14