nextflow-io / nextflow

A DSL for data-driven computational pipelines
http://nextflow.io
Apache License 2.0
2.68k stars 621 forks source link

Allow processes to print statuses in real time to stdout #1608

Closed Zethson closed 3 years ago

Zethson commented 4 years ago

New feature

Nextflow should allow processes to print anything that the process prints to stdout to current stdout. in real time. Currently, echo true only prints anything that was printed to stdout by the process after the process was completed.

Usage scenario

There are several, vital scenarios where this would come in handy.

  1. Machine learning: You absolutely require that you monitor the training in real time. Currently, ML integrated into Nextflow can run for days without any feedback about the training process. You never know how well your model is doing, nor how far the training is. If the training process goes poorly for whatever reason you only know so after its done some very expensive training.
  2. Print warnings when the data is bad or something unexpected is happening. Arbitrary example: Imagine you have 500 TB of sequencing data and are now running a nfcore pipeline, which does quality checks and downstream processing on it. This will take a very very long time and be expensive to run. Now if the quality checks could warn the user in real time that the data is awful, the pipeline could be stopped and a lot of time and money could be saved. Quality checks are kinda fast for NGS, but not necessarily for other domains such as proteomics.
  3. Many more, but I'm sure that you get the idea.

Suggest implementation

I expect that stdout would get very crowded if processes, which run multiple times and in parallel all print their status messages. Hence, it would possibly make sense if processes, which normally never split and do not run in parallel in Nextflow (such as machine learning on whole datasets - the ML frameworks take care of distributing the badges to the GPUs) would be allowed to do that. Maybe a new label would need to be introduced for this e.g. echo-real-time. Alternatively, maybe only up to 5 or whatever parallel processes would be allowed to print their status message or something like that with a user defined number. e.g. a process gets the label echo-real-time 5 and the first five in parallel launched processes would then get their own line in stdout and echo their stdout in real time.

This is the primary reason why we currently cannot use Nextflow for machine learning, which is a shame, since Nextflow is awesome for all too many reasons and the machine learning community is huge.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

haedong31 commented 10 months ago

Isn't it related to: https://github.com/nextflow-io/nextflow/discussions/3421 ?