Open rongou opened 6 years ago
hi, how to involve this issue.
I really don't know much about PMIx. If you are interested, you can try to prototype a solution.
Right now we start the worker pods and sleep, the launcher than calls mpirun
to launch the processes remotely. With PMIx, my understanding is each worker pod would start a PMIx server, then the launcher can start the processes using the PMIx API.
Probably need to dig a bit into Open MPI and/or SLURM code to figure this out.
I see. I can take a try on this issue.
Just to help clarify a bit: PMIx is just a library - there is no PMIx server "daemon" to run. The way you use it is to have your local launcher daemon on each node dlopen (or link against) the PMIx library and initialize it as a "server" (instead of a "client" or "tool"). This provides access to all the PMIx APIs, including the ones dedicated to server-side operations (see the pmix_server.h header for a list of them).
You would use these to get a launch "blob" for configuring the network prior to starting procs on compute nodes and other various operations. Your launcher would still start the processes itself.
I'd be happy to advise/help get this running - we'd love to see it in Kubernetes!
Please do holler if/when I can be of help. I confess my ignorance of the Kubernetes internals, but am happy to advise (and hopefully be educated along the way). I'd like to see all the MPIs and workflow-based models that rely on interprocess messaging get supported.
/assign @zw0610
From my experience with Slurm built with PMIx, user need no 'launcher pod' for each job submitted. This seems a clear benefit to mpi-operator.
I've been searching for a minimal workable example/tutorial for using openpmix Without Slurm for a while but did not succeed. So @rhc54 , would you mind providing us with such a tutorial/example to set a PMIx environment and launch an mpi task? It seems related to prte
but whole workflow is not very clear to me.
Happy to try. I'll read up a bit on Kubernetes and Kubeflow so I can try to provide more concrete direction (pointers to material would be welcome!). Meantime, you might be able to gain some insights from the following:
I'll work on a wiki specifically aimed at Kubernetes as a couple of organizations have expressed interest in such an integration, especially with the PMIx support for app-directed optimized operations becoming more popular with the workflow community. Can't promise a completion date, but I'll do my best.
Meantime, please feel free to ask questions.
Thank you so much for your prompt help, Ralph.
I watched your presentation video and believe there look two scenarios to work with Kubernetes and PMIx:
kubelet
) can take it as a container runtime.While the second scenario looks more native to Kubernetes, the first one is much similar to the contemporary design of this repo (mpi-operator) and should take less effort to achieve. So I will prefer the first one as the short-term & minor-scoped approach and go through the material you offered.
As some users/developers suggest taking mpi-operator to the Kubernetes community from Kubeflow community, we can try the second option as a long-term and broader scoped project after we accumulate abundant experience from the first try.
The negative to the first option is that you still have to completely instantiate a secondary RM - e.g., if that RM is Slurm, then one of the containers must include the slurmctld
, and the other containers must include the required info for the slurmd
daemons to connect back to that slurmctld
. This means that the users who construct these containers must essentially be Slurm sys admins, or at least know how to install and setup Slurm.
Alternatively, someone (probably the sys admin for the target Kubernetes environment) could provide users with a "base" container that has Slurm setup in it. However, that places constraints on the user as (for instance) the sys admin is unlikely to provide a wide array of containers users can choose from based on various operating systems. The sys admin would also have to guess when configuring Slurm as to how a user plans to utilize the container - or else the user will have to learn enough about Slurm to at least modify the configuration as required for their use-case.
The objective of the second option is to eliminate the need for a secondary RM and allow the user's container to strictly focus on the application. As you note, it does require that the PMIx server be integrated into Kubernetes itself so that the applications in the containers can remain simple PMIx clients. However, I believe it would best support the growing interest in HPC workflow computing (i.e., breaking down the traditional MPI bulk-synchronous model into many independent tasks that coalesce into the answer) and hybrid (data analytics + deep learning + MPI) programming models. It is the only method that lets the user focus solely on their application instead of having to learn how to setup and manage an RM, and the only method that allows the container to be portable across systems (e.g., a Kubernetes-based cloud and a Slurm-based HPC cluster).
Personally, I believe the second option is the better one and have focused my attention on it. However, I certainly understand it is more challenging and you may choose to pursue the first option in its place, at least for now. Let me know if/how I can help.
let me update the progress so far. But first, sorry for the late update as I was working on the python-sdk-for-mpijob feat.
I went through most material mentioned by Ralph and got a quite basic understanding on pmix
. Following this article from PBS Pro, I've made a docker image with both openpmix
and prrte
installed.
(Please note that 1) neither /opt/pmix/bin
nor /opt/prrte/bin
is included in PATH
; 2) no entrypoint specified for the image)
After start the pprte with prte -d
, I was able to use prun
to launch process within the same container (where prte
is running). So far, I was blocked by two issues:
prun -H <hostfile> -n 2 xxx
failed when prun
is executed in another container (on k8s). In short, I was not able to launch process via prun
and prte
remotely.prte
and prun
, that still does not give me the exact idea to get rid of the launcher container, which means start a job without prun
. Maybe we can follow the PMI standard and let mpi-operator to tell prte
on each worker pod directly about what processes should be launched. Is that workable? Is that a really good idea?But anyway, let me fix the first issue and I'll keep updating here.
/assign
There hasn't been much motion on this issue. Is there anything I can do to help?
I would love to see this completed, if possible. I learned that a colleague at IBM (@jjhursey) is going to describe a Kubernetes/PMIx effort this week - perhaps he could share how that might relate here?
It can also be shared during a kubeflow training meeting. Please let me know if you plan to do so, as I'd like to attend.
Also, for those of you at SC this week - Jai Dayal (@jaidayal) of Samsung is going to give a very brief description of the work we are collaborating on to use PMIx to integrate dynamic applications to a dynamic scheduler at the PMIx BoF meeting. There will also be a couple of other talks about similar efforts. I would heartily recommend attending, if you can. If nothing else, it might be worth your while to make the connections so as to follow those works.
My expectation is that we will be releasing several updates next year focused on dynamic operations - preemption of running jobs, request for resource allocations/adjustments, etc. The BoF will provide an introduction to those efforts.
My talk at SC22 was presented at the CANOPIE-HPC workshop
The organizers should be posting the slides at some point. I'm planning on giving a more PMix focused version of the talk during the PMIx Standard ASC meeting in Jan link.
Thanks, I'll take a look once the talks are available.
Shameless plug, in case you didn't know: https://opensource.googleblog.com/2022/10/kubeflow-applies-to-become-a-cncf-incubating-project.html
It would be great if the PMIx solution could be integrated with kubeflow or (even better) the kubernetes Job API directly.
/cc
Shameless plug, in case you didn't know: https://opensource.googleblog.com/2022/10/kubeflow-applies-to-become-a-cncf-incubating-project.html
Ah - no I was unaware of this! Congrats to all involved. I'm retired and so wouldn't really be able to write the integration code, but I am happy to advise and/or contribute where possible if someone wishes to pursue this. Having a "native" way of starting parallel applications in a Kubeflow environment would seem desirable, and extending that later to support dynamic integration to the scheduler itself would seem a win-win for all.
The Open MPI people suggested running a PMIx server on each worker pod, and use the PMIx API to launch. Need to investigate whether that's a better approach.
SLURM has some useful information: https://slurm.schedmd.com/mpi_guide.html PMIx home: https://pmix.org/