soedinglab / plass

sensitive and precise assembly of short sequencing reads
https://plass.mmseqs.com
GNU General Public License v3.0
149 stars 14 forks source link

Reduce TMP disk usage #5

Closed genomewalker closed 5 years ago

genomewalker commented 5 years ago

Hi Martin it would be possible that PLASS has an option to remove the intermediate files (i.e. pref_, aln_, assembly_) of the iterations that are not going to be used anymore in the following steps. For some of the assemblies, the disk usage explodes and goes up to several terabytes. As a temporal solution I added to assembler.sh the following lines to remove the files from previous steps:

  if [ "${STEP}" -ge 2 ]; then
    PSTEP="$((STEP-2))"
    rm -f "${TMP_PATH}/pref_${PSTEP}"
    rm -f "${TMP_PATH}/pref_${PSTEP}"_*
    rm -f "${TMP_PATH}/pref_${PSTEP}".*
    rm -f "${TMP_PATH}/aln_${PSTEP}"
    rm -f "${TMP_PATH}/aln_${PSTEP}"_*
    rm -f "${TMP_PATH}/aln_${PSTEP}".*
    rm -f "${TMP_PATH}/assembly_${PSTEP}"
    rm -f "${TMP_PATH}/assembly_${PSTEP}"_*
    rm -f "${TMP_PATH}/assembly_${PSTEP}".*
  fi

Many thanks Antonio

milot-mirdita commented 5 years ago

Hi Antonio The files from the previous steps are important in case something goes wrong. Then pass can be restarted and continue where it left of previously. One thing that we are working currently in MMseqs2 and thus will end up in Plass very soon is compressed databases, which should also help for this issue. A more general solution will take some time since we will want to introduce this feature to all workflows (of MMseqs2 and Plass) at the same time. Best regards Milot

genomewalker commented 5 years ago

Hi Milot compressed databases sound awesome! I look forward to them! Compressed DBs will be very useful for the mapping step as well; the prefiltering DBs are huge as well :-)

By now I will take the risk to remove the previous steps files, our nodes have a limited scratch space of 2TB and several assemblies die because of the lack of space.

Many thanks! Antonio

jacodela commented 5 years ago

I've been having the same issue, where a large but not massive dataset needs more than 4Tb of tmp space, regardless if I'm running a coassembly or doing it by sample. @genomewalker did your temporary fix work correctly?

genomewalker commented 5 years ago

Yes, it does solve the problem but as @milot-mirdita pointed you will not be able to restart if something fails.

martin-steinegger commented 5 years ago

Plass removes temporary files now on the fly. This should roughly reduce the hard disk consumption by a factor of 12.