nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.7k stars 621 forks source link

Process prolog and epilog #540

Open pditommaso opened 6 years ago

pditommaso commented 6 years ago

The goal of this enhancement is to add two new process directives, namely: prolog and epilog that would make it possible to add to process task a common script prefix and suffix.

These are different from beforeScript and afterScript directives that are designed mainly for custom task configuration and, above all, are execute in the task wrapper context, that can be a separated environment from the task execution one (in particular when using containers).

Instead prolog content should be prepended to the user script just after the shebang declaration and epilog should appended to the user task script.

pditommaso commented 6 years ago

This can be a little harder than expect. The basic idea was to prepend/append the user script with the prolog and epilog snippets when defined. However, provided the prolog and epilog should be BASH code, that would corrupt non-BASH user commands.

It cannot either included in the command launcher ie. .command.run, because that would not allow to execute it when the user command is executed in a container image.

ODiogoSilva commented 6 years ago

It cannot either included in the command launcher ie. .command.run, because that would not allow to execute it when the user command is executed in a container image.

This is due to the issue of not adding the bin directory to the path when using a container image? Is there any particular reason to prevent both from being using in a process?

In my case, I need to execute some bash scripts (which are in the bin directory) in the work directory before (and after) some processes (I originally though that the epilog and prolog were meant to allow the execution of such scripts, which wouldn't have impact on the user commands/templates). This is regardless of the processes running in a container image or not. In fact, I made a small tweak in nextflow to allow the usage of both bin and container images in the same process and it has been working fine. Maybe I'm missing some use cases where they would conflict somehow?

pditommaso commented 6 years ago

No, this happens because it's the .command.run that launch the container, therefore it cannot execute the prolog and the epilog.

One solution, that I would like to avoid it to use an intermediate wrapper that contains the prolog, run the command and finally executes the epilog

A better alternative, could be to embed the user prolog and epilog in a couple of variables in the .command.run via heredoc declaration, then append the container execution command line, but there could be tricky side effects on special characters expansion or in the max length of the command line.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

ewels commented 4 years ago

Bump

abhi18av commented 3 years ago

With the usage of this functionality coming up in some nf-core workflows (see discussion here) as well as independent users, perhaps this might be an useful feature to implement in NF.

Building upon the idea shared earlier in the thread, I'm adding relevant doc links for reference https://tldp.org/LDP/abs/html/here-docs.html

ewels commented 3 years ago

Regarding the nf-core tool version calls, see this related issue: https://github.com/nextflow-io/nextflow/issues/879

Thinking about it now, it might be good to have both prolog/epilog and versionCmd (presumably working in the same way). This is because in nf-core pipelines we will presumably set epilog for every single process, meaning that if an end user wants to run something custom then they will not be able to without overwriting that and losing the version information.

Also need to think about how staging files in and out will work. For example, with the above scenario, overwriting epilog would break everything as the processes will be expecting the version call output file as a channel output.

If prolog and epilog are not meant for use by the end user and only the pipeline developer then neither of these are particularly problematic. But in that case, maybe they don't add so much functionality either (as they can be part of the script block as we currently do).

Final point: I would personally spell it epilogue 😉

pditommaso commented 3 years ago

It's a programming lang not Shakespeare 😆

ewels commented 3 years ago

It's a programming lang not Shakespeare 😆

haha 😆 Ok if it's a well known programming term that's fine. I mis-typed it about 4 times in that past issue before googling it to check I wasn't going insane. I haven't come across it in code before. epilog is good though 👍🏻

abhi18av commented 3 years ago

Also need to think about how staging files in and out will work. For example, with the above scenario, overwriting epilog would break everything as the processes will be expecting the version call output file as a channel output.

Perhaps the issue regarding the version info file could be solved on the Tower side (Tower report specific channels have been discussed elsewhere), since it'd be essentially a file containing the version string and then this could be displayed/gathered into a report via Tower.

Thoughts?

NOTE: Just wanted to confirm that we all agree this the epilog and prolog aren't supposed to be used for heavy processing i.e. they don't have the same use-case as script/shell/exec. This might create confusing usage patterns.

ewels commented 3 years ago

Tower is nice, but we want the version numbers even when people are running without Tower 😉 I view it as a fairly essential output from the pipeline.

pditommaso commented 3 years ago

Yeah, we discussed ages ago .. 😬 https://github.com/nextflow-io/nextflow/issues/879

kemin711 commented 1 year ago

Maybe related to prelog epilog concept. I have been thinking about one case: fileA -> fileB; fileB->fileC (fileC will be used for down stream process); fileB--compress-->fileB.gz, will not be used for any process, but want to publish it for later usage. In order to make my workflow faster, I don't want to wait for the compression to finish before lettig the fileC to be used for downstream processes. My question, is the current workflow language taking advantage of this? Do the donwstream process wait for the complete completion before its output being used by downstream processes? Or the result can be used as soon as the file is finished (this may not be possible because you have to wait for the completion of all command in the scripts section to finish). Might be beneficial to make the process separate the script section into the current scripts, and another section background_scripts.

bentsherman commented 1 year ago

In that case, you should define two separate processes, one to produce fileC and another to compress fileC. Any downstream processes can depend only on the first process while the second process would run "in the background".

kemin711 commented 1 year ago

That's how my workflow is designed. Only in special cases, where my files are large. The processes are done on the scratch. To avoid copying files, I did the opposite of normally we do, merged several processes into one. This is very rare. And it is only for performance. Essentially, we have large gzped fastq files. So do some operation with them. one way is to use the compressed as input, you then decompress on the fly. This saves space. Another way, is to inflate the .gz files, first, then do operations on them this can speed up. but at the cost of more storage usage. If we move the operation to /scratch, then the IO problem can be resolved. We want the main logic to process using inflated files, at the end of this main process, the downstream can start working with the result files from these large file, mean time, we can keep on compressing these file with high compression level then store these compressed file for future use (not this pipeline). There might be a possibility to enable the script section to branch. Currently script section is an atomic operation.

rollf commented 3 months ago

I used the following approach for now:

process {
  ...
  script:
  def prolog = task.ext.prolog ?: ''
  ...
  """
  $prolog
  <actual script>
  """"
}
    withName: MY_PROCESS {
    ext.prolog = "export PATH=/some/other/location:\$PATH"
    }

And then I use nf-core modules patch for nf-core modules as necessary. My use case: Run a custom docker image but use predefined nf-core modules. The custom docker image needed some setup to be run upfront before the nf-core-based module would work (hence the PATH adaptation in ext.prolog above).

I'm in favor for nextflow-based solution.

kemin711 commented 3 months ago

thanks for the feedback.  

ewels commented 3 months ago

@rollf note that 24.04 release included a new directive eval (see docs) which I think is very similar to the prolog suggested in this issue 🤔

That might be a cleaner solution for your use case?

rollf commented 3 months ago

@ewels Thanks for the hint. I do not understand the suggestion, though. eval would allow me to add further output (channels) to the process, however, my use case is that I want to arbitrarily modify the existing script that runs in the container. Possibly I'm completely wrong here but I don't see the connection between both. :shrug:

(As a side remark, eval seems to be missing in the overview here.)

ewels commented 3 months ago

Yeah you're right. I was thinking that eval adds snippets of code to the start of the process, outside of the script block - which is kind of what you want. But I hadn't really thought it through - you're right that it doesn't make sense here as you'd still need to edit the process. Apologies.

pditommaso commented 3 months ago

Added the missing reference to eval in the output overview. Thanks for reporting it. 97090673