nrlab-CRUK / TAP

Trim and Align Pipeline
0 stars 0 forks source link

Intermediate files are filling the disk #5

Open rich7409 opened 1 year ago

rich7409 commented 1 year ago

When the TAP system runs it can fill the disk quota for large data sets. It has been found that the intermediate files can take up ten times the final results' disk space. Can they be cleaned as we go?

rich7409 commented 1 year ago

I've done a bit of experimenting and added a little Groovy script removeInput.groovy that can be run after the main part of the task that will remove a symbolic link and, crucially, the target of that link. It should be run when the main process is successful.

We need to use it carefully and doing so will stop the pipeline from being resumed. To that end it is controlled by the parameter EAGER_CLEANUP. This is by default off, leaving the intermediate files in place. If a large data set is causing disk issues though it can be set to true, and some tasks will then delete their inputs if they have succeeded.

rich7409 commented 1 year ago

Given this a bit of a test and it seems to work. Here are two runs on the same data.

nm168s011789 1011% du -hcs work/ 21G work/ 21G total

nm168s011789 1012% du -hcs work/ 3.7G work/ 3.7G total

21G down to 3.7G when the intermediate files are removed.