nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.68k stars 621 forks source link

Wait until pipeline completes before publishing #5239

Open adamrtalbot opened 3 weeks ago

adamrtalbot commented 3 weeks ago

New feature

Currently, Nextflow will publish files as it proceeds through the pipeline. If it fails, it will halt leaving some files in place. This creates messy output structures that need to be cleaned after running the pipeline.

Usage scenario

It may be desirable to wait until we know the pipeline has completed successfully before publishing to make sure all files will be created successfully and isolate pipeline running from output file creation.

This is sort of possible with the new output syntax because you could use channel operations to 'hold' the files until all are complete, e.g. this will use groupTuple to hold the files until they are all emitted, essentially blocking publication until the pipeline has completed.

workflow {
    ...
    my_channel
        .map { myFiles ->
            tuple( "1", myFiles )
        }
        .groupTuple()
        .set { output_channel }

    publish:
    output_channel >> 'outputs'
}

output {
    directory 'results'
}

However we may just wish to force it to happen after the pipeline completes but before the workflow.onComplete happens.

Suggest implementation

This could be an additional option to the output section:

output {
    when 'onComplete' // 'immediately' (default), 'onProcessComplete', 'whenPigsFly'
    mode 'copy'
    ...
}

Or some other way of holding the files back until the last moment.

bentsherman commented 3 weeks ago

The problem is that deferring all publishing to the end of the workflow can increase the total runtime significantly. The option you propose would punt the trade-off to the user, but I wonder if we can do better.

We have discussed before, simply deleting the output directory in the workflow.onError handler. I believe this behavior was added to the nf-core template. I think that's not so bad considering that it should be the minority path, i.e. most of your production runs should be succeeding or else you likely have bigger problems.

Another option I'm considering is to treat publish tasks like regular tasks and allow them to have an error strategy. There is already some basic retry logic, and the ignoreErrors option essentially functions as an "ignore" error strategy. I wonder if it would be useful to have something like a "finish" strategy for publishing as well.

adamrtalbot commented 3 weeks ago

The problem is that deferring all publishing to the end of the workflow can increase the total runtime significantly. The option you propose would punt the trade-off to the user, but I wonder if we can do better.

I think this is OK, it's a choice the user makes to defer publishing to the end at the cost of runtime. If it's something they want that's fine.

We have discussed before, simply deleting the output directory in the workflow.onError handler. I believe this behaviour was added to the nf-core template. I think that's not so bad considering that it should be the minority path, i.e. most of your production runs should be succeeding or else you likely have bigger problems.

I'm not a fan of this option because it doesn't work if the pipeline doesn't end gracefully or if anything prevents files being cleared up. It just feels...inefficient? Creating files just to delete them later?

The main downside is this directly conflicts with any clearup methods that get implemented, although it wouldn't be hard to make the options mutually exclusive.

Another option I'm considering is to treat publish tasks like regular tasks and allow them to have an error strategy. There is already some basic retry logic, and the ignoreErrors option essentially functions as an "ignore" error strategy. I wonder if it would be useful to have something like a "finish" strategy for publishing as well.

This seems like a good idea, also it keeps syntax consistent which is always a bonus.

Of course, this is low priority since with the new output DSL we can use channel operators to achieve this 🥳