kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
440 stars 218 forks source link

investigate PMIx #12

Open rongou opened 6 years ago

rongou commented 6 years ago

The Open MPI people suggested running a PMIx server on each worker pod, and use the PMIx API to launch. Need to investigate whether that's a better approach.

SLURM has some useful information: https://slurm.schedmd.com/mpi_guide.html PMIx home: https://pmix.org/

yncxcw commented 6 years ago

hi, how to involve this issue.

rongou commented 6 years ago

I really don't know much about PMIx. If you are interested, you can try to prototype a solution.

Right now we start the worker pods and sleep, the launcher than calls mpirun to launch the processes remotely. With PMIx, my understanding is each worker pod would start a PMIx server, then the launcher can start the processes using the PMIx API.

Probably need to dig a bit into Open MPI and/or SLURM code to figure this out.

yncxcw commented 6 years ago

I see. I can take a try on this issue.

rhc54 commented 6 years ago

Just to help clarify a bit: PMIx is just a library - there is no PMIx server "daemon" to run. The way you use it is to have your local launcher daemon on each node dlopen (or link against) the PMIx library and initialize it as a "server" (instead of a "client" or "tool"). This provides access to all the PMIx APIs, including the ones dedicated to server-side operations (see the pmix_server.h header for a list of them).

You would use these to get a launch "blob" for configuring the network prior to starting procs on compute nodes and other various operations. Your launcher would still start the processes itself.

I'd be happy to advise/help get this running - we'd love to see it in Kubernetes!

rhc54 commented 6 years ago

Please do holler if/when I can be of help. I confess my ignorance of the Kubernetes internals, but am happy to advise (and hopefully be educated along the way). I'd like to see all the MPIs and workflow-based models that rely on interprocess messaging get supported.

gaocegege commented 3 years ago

/assign @zw0610

zw0610 commented 3 years ago

From my experience with Slurm built with PMIx, user need no 'launcher pod' for each job submitted. This seems a clear benefit to mpi-operator.

I've been searching for a minimal workable example/tutorial for using openpmix Without Slurm for a while but did not succeed. So @rhc54 , would you mind providing us with such a tutorial/example to set a PMIx environment and launch an mpi task? It seems related to prte but whole workflow is not very clear to me.

rhc54 commented 3 years ago

Happy to try. I'll read up a bit on Kubernetes and Kubeflow so I can try to provide more concrete direction (pointers to material would be welcome!). Meantime, you might be able to gain some insights from the following:

I'll work on a wiki specifically aimed at Kubernetes as a couple of organizations have expressed interest in such an integration, especially with the PMIx support for app-directed optimized operations becoming more popular with the workflow community. Can't promise a completion date, but I'll do my best.

Meantime, please feel free to ask questions.

zw0610 commented 3 years ago

Thank you so much for your prompt help, Ralph.

I watched your presentation video and believe there look two scenarios to work with Kubernetes and PMIx:

  1. each container is considered as a RM, with its RM daemon running as the entry point process. With new mpi task dispatched, new processes can be launched via the RM daemon.
  2. making the PMIx client wrapped so the Kubernetes (kubelet) can take it as a container runtime.

While the second scenario looks more native to Kubernetes, the first one is much similar to the contemporary design of this repo (mpi-operator) and should take less effort to achieve. So I will prefer the first one as the short-term & minor-scoped approach and go through the material you offered.

As some users/developers suggest taking mpi-operator to the Kubernetes community from Kubeflow community, we can try the second option as a long-term and broader scoped project after we accumulate abundant experience from the first try.

rhc54 commented 3 years ago

The negative to the first option is that you still have to completely instantiate a secondary RM - e.g., if that RM is Slurm, then one of the containers must include the slurmctld, and the other containers must include the required info for the slurmd daemons to connect back to that slurmctld. This means that the users who construct these containers must essentially be Slurm sys admins, or at least know how to install and setup Slurm.

Alternatively, someone (probably the sys admin for the target Kubernetes environment) could provide users with a "base" container that has Slurm setup in it. However, that places constraints on the user as (for instance) the sys admin is unlikely to provide a wide array of containers users can choose from based on various operating systems. The sys admin would also have to guess when configuring Slurm as to how a user plans to utilize the container - or else the user will have to learn enough about Slurm to at least modify the configuration as required for their use-case.

The objective of the second option is to eliminate the need for a secondary RM and allow the user's container to strictly focus on the application. As you note, it does require that the PMIx server be integrated into Kubernetes itself so that the applications in the containers can remain simple PMIx clients. However, I believe it would best support the growing interest in HPC workflow computing (i.e., breaking down the traditional MPI bulk-synchronous model into many independent tasks that coalesce into the answer) and hybrid (data analytics + deep learning + MPI) programming models. It is the only method that lets the user focus solely on their application instead of having to learn how to setup and manage an RM, and the only method that allows the container to be portable across systems (e.g., a Kubernetes-based cloud and a Slurm-based HPC cluster).

Personally, I believe the second option is the better one and have focused my attention on it. However, I certainly understand it is more challenging and you may choose to pursue the first option in its place, at least for now. Let me know if/how I can help.

zw0610 commented 3 years ago

let me update the progress so far. But first, sorry for the late update as I was working on the python-sdk-for-mpijob feat.

I went through most material mentioned by Ralph and got a quite basic understanding on pmix. Following this article from PBS Pro, I've made a docker image with both openpmix and prrte installed. (Please note that 1) neither /opt/pmix/bin nor /opt/prrte/bin is included in PATH; 2) no entrypoint specified for the image)

After start the pprte with prte -d, I was able to use prun to launch process within the same container (where prte is running). So far, I was blocked by two issues:

  1. prun -H <hostfile> -n 2 xxx failed when prun is executed in another container (on k8s). In short, I was not able to launch process via prun and prte remotely.
  2. Even if I was able to launch process remotely with prte and prun, that still does not give me the exact idea to get rid of the launcher container, which means start a job without prun. Maybe we can follow the PMI standard and let mpi-operator to tell prte on each worker pod directly about what processes should be launched. Is that workable? Is that a really good idea?

But anyway, let me fix the first issue and I'll keep updating here.

ArangoGutierrez commented 2 years ago

/assign

thoraxe commented 1 year ago

There hasn't been much motion on this issue. Is there anything I can do to help?

rhc54 commented 1 year ago

I would love to see this completed, if possible. I learned that a colleague at IBM (@jjhursey) is going to describe a Kubernetes/PMIx effort this week - perhaps he could share how that might relate here?

alculquicondor commented 1 year ago

It can also be shared during a kubeflow training meeting. Please let me know if you plan to do so, as I'd like to attend.

rhc54 commented 1 year ago

Also, for those of you at SC this week - Jai Dayal (@jaidayal) of Samsung is going to give a very brief description of the work we are collaborating on to use PMIx to integrate dynamic applications to a dynamic scheduler at the PMIx BoF meeting. There will also be a couple of other talks about similar efforts. I would heartily recommend attending, if you can. If nothing else, it might be worth your while to make the connections so as to follow those works.

My expectation is that we will be releasing several updates next year focused on dynamic operations - preemption of running jobs, request for resource allocations/adjustments, etc. The BoF will provide an introduction to those efforts.

jjhursey commented 1 year ago

My talk at SC22 was presented at the CANOPIE-HPC workshop

The organizers should be posting the slides at some point. I'm planning on giving a more PMix focused version of the talk during the PMIx Standard ASC meeting in Jan link.

alculquicondor commented 1 year ago

Thanks, I'll take a look once the talks are available.

Shameless plug, in case you didn't know: https://opensource.googleblog.com/2022/10/kubeflow-applies-to-become-a-cncf-incubating-project.html

It would be great if the PMIx solution could be integrated with kubeflow or (even better) the kubernetes Job API directly.

ahg-g commented 1 year ago

/cc

rhc54 commented 1 year ago

Shameless plug, in case you didn't know: https://opensource.googleblog.com/2022/10/kubeflow-applies-to-become-a-cncf-incubating-project.html

Ah - no I was unaware of this! Congrats to all involved. I'm retired and so wouldn't really be able to write the integration code, but I am happy to advise and/or contribute where possible if someone wishes to pursue this. Having a "native" way of starting parallel applications in a Kubeflow environment would seem desirable, and extending that later to support dynamic integration to the scheduler itself would seem a win-win for all.