nextflow-io / nf-hack17

Nextflow hackathon 2017 projects
10 stars 2 forks source link

Project 9: Approaches to scaling out Nextflow. #9

Open tdudgeon opened 7 years ago

tdudgeon commented 7 years ago

Project

Approaches to scaling out Nextflow. Comparing running in Kubernetes and HPC environments.

Data:

A few examples of workflows can be found here, but no specific large size datasets will be needed.

Computing resources:

I expect this will mostly be working out how to try things out on different environments, rather than trying to execute any particular HPC workflows, but if access to a small Kubernetes cluster was possible that would be useful. Same for HPC environments.

Project Lead:

Tim Dudgeon. Assistance from people familiar with using Nextflow in these environments would be useful.

pditommaso commented 7 years ago

This issue may be related https://github.com/nextflow-io/nextflow/issues/446

tdudgeon commented 7 years ago

We handled this as part of #11 as AWS batch is one of the suitable systems for running large jobs.

One thing discussed was that it would be useful to have a section in the docs that compares the different executors (differences, pros, cons etc.).

tdudgeon commented 7 years ago

Regarding Kubernetes the docs for the executor are here.

It states that a Nextflow each process is run as a pod and the node that runs that pod must contain the kubctl command line tool. This raises some questions:

  1. why as a pod and not a Kubernetes job?
  2. does NF provide a mechanism to select the node to run on using a selector (assuming you have a label(s) that can identify suitable nodes)? Also, which namespace in which to run? Another need might be to specify specialist capabilities like the node having support for GPU.
  3. When running on OpenShift (Red Hat's distribution of Kubernetes pods/jobs would want to be launched using the oc tool rather than kubectl

It will be interesting to investigate how running on Kubernetes compares with other executors. e.g. how does the Kubernetes scheduler and other aspects compare.

pditommaso commented 7 years ago

Hi Tim, regarding your points:

  1. when I first tried the Job specification I've noticed that when a job failed, it was rescheduled for re-execution even if the restart policy was set to Never (see here). The discussion was suggesting that was the correct behaviour (?) and frankly I don't know if it has been fixed or it still behave like that. I change to a pod because the error management strategy for failing task needs to be managed by NF.

  2. No, but we may want to have that feature. Currently this is delegated to the underlying cluster that should implement a queue concept.

  3. An even better solution could be to interact directly with the Kubernetes REST API. What do you think? or do you need to manage some specific OpenShift feature?