nf-core / mag

Assembly and binning of metagenomes
https://nf-co.re/mag
MIT License
216 stars 110 forks source link

Add option to split SPAdes read correction into separate process or enable SPAdes checkpoints #440

Open alexhbnr opened 1 year ago

alexhbnr commented 1 year ago

Description of feature

When running metaSPAdes as part of nf-core/mag, the first step is the read correction followed by the actual assembly steps. When using the sensible default resource settings of nf-core/mag to run SPAdes, SPAdes might run out of memory for large samples with a lot of sequencing data. Upon re-starting the step, SPAdes will then start from scratch and first perform the read correction again, even if this was successful in the previous attempt.

The read correction step is rather time consuming and can take more than 15 hours for samples with more than 100 million reads. However, it often has slightly lower memory requirements than the actual assembly steps. Restarting with read correction each time SPAdes failed due to low memory in the assembly step seems to me a waste of resources and computing time. The same is true to just run all samples with high memory requirements by default.

There are two possible solutions to avoid this dilemma:

  1. SPAdes allows to restart from checkpoints, i.e. the last completed step, and therefore would not re-run read correction, if this step finished successfully in a previous attempt. However, despite my limited knowledge of Nextflow I assume this might be tricky given that a new temporary folder is created for each process.
  2. The process SPAdes is split into SPADES_READCORRECTION and SPADES_ASSEMBLY. SPADES_ASSEMBLY would still run of from the files produced by SPADES_READCORRECTION but it would avoid rerunning the read corrections in case the assembly step fails.
d4straub commented 1 year ago

To my knowledge only 2. works.