jtamames / SqueezeMeta

A complete pipeline for metagenomic analysis
GNU General Public License v3.0
365 stars 78 forks source link

CD-HIT taking too long when running in merged mode. Is it as expected? #615

Closed auroralabastida closed 1 year ago

auroralabastida commented 1 year ago

Hello there! I am running the assembly of 24 samples in merged mode. I started with 6.7 - 11.2 Gigabases and 44 - 74 million PE (2x150) reads per sample. I am running in an cloud instance with 3.6 Ghz processor, 32 Gb of RAM and 8 cores.

The assembly of each file took 4 - 5 hours and produced assemblies with avg. contig size from 457 to 665 bp and N50 from 441 to 677 bp. A total of 23 million contigs were obtained for all of the samples.

CD-HIT-EST started running after the last assembly with parameters:

/home/ubuntu/miniconda3/envs/SqueezeMeta/SqueezeMeta/bin/cd-hit-est -i /home/ubuntu/merged/temp/mergedassemblies.merged.fasta -o /home/ubuntu/merged/temp/mergedassemblies.merged.99.fasta -T 8 -M 0 -c 0.99 -d 100 -aS 0.9 > /dev/null 2>&1

CD-HIT-EST has already been running for 17 hours.

Is CD-HIT behaving as expected? How much longer can I expect the process to take?

Thank you very much! Aurora

jtamames commented 1 year ago

Hello Oh yes, it is perfectly normal. Take into account that you are running a huge analysis that probably will provide around 100-200 million genes, enough to keep you busy for the rest of your career. That will take some time. Indeed, that is not the slowest step in the process. After that, minimus2 will try to merge assemblies and that will take a lot of time. Besides, your computer is not very powerful. It is not unreasonable to expect several weeks for the analysis to finish. Seqmerge mode is faster than merged, but not much faster. Best, J

auroralabastida commented 1 year ago

Hello Javier! Thanks for your quick reply

I could change to a more powerful instance. My limitation here is the cost of the AWS On-Demand service, which is calculated according to time of usage. The price per hour increases more or less proportionately to the number of cores and RAM requested.

Do any of the following steps would have their run-time reduced proportionately to the increase of RAM or cores?

  1. Assembly and Assembly Merging a. Megahit b. CD-HIT-EST c. minimus2
  2. Prodigal
  3. Diamond
  4. Hmmer

I have seen that step 4 could require a lot of RAM. What is the minimum RAM that you would recommend? I can ask for up to 512 Gb.

Some of my questions may be more appropriate for a seer, so any estimate that you could give me would be of great help.

Finally, Is it viable to restart the process from the minimus2 step?

Thank you again! Aurora

jtamames commented 1 year ago

Hello again For switching to seqmerge, take a look at issue #420 Assembly and merging will of course benefit of having more RAM. Indeed, it is unlikely that minimus2 can work with just 32 Gb. Hmmer requires abundant RAM s well, but not as much as the assembly (and remember you can skip that step using the --nopfam flag). Another approach is dividing your samples in groups, analyze these separately, and then combine the results using combineSQM in SQMtools. Hope you can manage to get it done. Best, J

auroralabastida commented 1 year ago

Thanks Javier. Taking into account that minimus2 might escalate poorly I will try the coassembly. In your experience, can I expect that using '-contiglen 500' reduces the run time of the assembly (specially the cleaning rounds) or the DIAMOND annotation? Have a nice day! Aurora

jtamames commented 1 year ago

Yes, of course. Setting -c will decrease all times from step 1. You could even set -c to a higher value (1000 or even more), given that you still will have tons of data. That will reduce the representation of rare species, take that into account. But I think you will need much more RAM for trying the coassembly. Unfortunately it difficult to know how much. Best, J

auroralabastida commented 1 year ago

Thanks again! I will continue with the co-assembly with 512Gb of RAM, which was enough to finish the k21-kmer co-assembly in approximately 19 hours in a previous test. Thanks a lot for your help!