nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.71k stars 621 forks source link

"cleanup" option does not remove staged files from S3 #5091

Open stevekm opened 3 months ago

stevekm commented 3 months ago

Bug report

When using Nextflow with the cleanup = true option, input files staged from S3 are left in the work dir.

Expected behavior and actual behavior

In order to automatically clean up the work directory after a successful pipeline run, I was hoping that the cleanup option described here might also remove the S3 input files that were staged during pipeline execution. This does not seem to be the case and the files remain in the work dir under a path such as work/stage-xyz

You can reproduce this by running a pipeline with input files on S3, and include the option cleanup = true in your nextflow.config file. The contents of the task work dirs are removed but the staged files remain.

Environment

Additional context

Not sure if this is intentional?

bentsherman commented 3 months ago

The cleanup only iterates through the task directories, that is why it doesn't delete those stage directories. In fact I don't think the cleanup works at all on S3 (see #3645).

You can use nf-boost which has an experimental cleanup that is more efficient, but I haven't implemented cleanup for the stage directories.

If I recall correctly, each run has it's own stage directory of the pattern work/stage-${sessionId}, so a simple solution would be to just delete that directory at the end. A more aggressive solution would be to delete individual subdirectories as soon as they aren't needed anymore, but I'm not sure how difficult that would be.

stevekm commented 3 months ago

Thanks. I was hoping for some solution that could be bundled inside of the nextflow.config so that it would get run automatically. I will try out nf-boost as well though would still want some way to "un-stage" the S3 files at the end of the pipeline

bentsherman commented 3 months ago

You might be able to do it with a workflow onComplete handler in the config file. Something like this:

// nextflow.config
worflow.onComplete = {
    workDir.resolve("stage-${workflow.sessionId}").deleteDir()
}

See also: https://nextflow.io/docs/latest/metadata.html#decoupling-metadata