When running metaSPAdes as part of nf-core/mag, the first step is the read correction followed by the actual assembly steps. When using the sensible default resource settings of nf-core/mag to run SPAdes, SPAdes might run out of memory for large samples with a lot of sequencing data. Upon re-starting the step, SPAdes will then start from scratch and first perform the read correction again, even if this was successful in the previous attempt.
The read correction step is rather time consuming and can take more than 15 hours for samples with more than 100 million reads. However, it often has slightly lower memory requirements than the actual assembly steps. Restarting with read correction each time SPAdes failed due to low memory in the assembly step seems to me a waste of resources and computing time. The same is true to just run all samples with high memory requirements by default.
There are two possible solutions to avoid this dilemma:
SPAdes allows to restart from checkpoints, i.e. the last completed step, and therefore would not re-run read correction, if this step finished successfully in a previous attempt. However, despite my limited knowledge of Nextflow I assume this might be tricky given that a new temporary folder is created for each process.
The process SPAdes is split into SPADES_READCORRECTION and SPADES_ASSEMBLY. SPADES_ASSEMBLY would still run of from the files produced by SPADES_READCORRECTION but it would avoid rerunning the read corrections in case the assembly step fails.
Description of feature
When running
metaSPAdes
as part of nf-core/mag, the first step is the read correction followed by the actual assembly steps. When using the sensible default resource settings of nf-core/mag to run SPAdes,SPAdes
might run out of memory for large samples with a lot of sequencing data. Upon re-starting the step,SPAdes
will then start from scratch and first perform the read correction again, even if this was successful in the previous attempt.The read correction step is rather time consuming and can take more than 15 hours for samples with more than 100 million reads. However, it often has slightly lower memory requirements than the actual assembly steps. Restarting with read correction each time
SPAdes
failed due to low memory in the assembly step seems to me a waste of resources and computing time. The same is true to just run all samples with high memory requirements by default.There are two possible solutions to avoid this dilemma:
SPAdes
allows to restart from checkpoints, i.e. the last completed step, and therefore would not re-run read correction, if this step finished successfully in a previous attempt. However, despite my limited knowledge of Nextflow I assume this might be tricky given that a new temporary folder is created for each process.SPAdes
is split intoSPADES_READCORRECTION
andSPADES_ASSEMBLY
.SPADES_ASSEMBLY
would still run of from the files produced bySPADES_READCORRECTION
but it would avoid rerunning the read corrections in case the assembly step fails.