Open danielvegamyhre opened 1 year ago
Hey @vsoch. Have you got a working example with MPI and JobSet? I know you are doing some benchmarks for K8s/HPC so was just curious if you got MPI working with jobSet.
Hey @vsoch. Have you got a working example with MPI and JobSet? I know you are doing some benchmarks for K8s/HPC so was just curious if you got MPI working with jobSet.
I have several! I have a prototype of the Flux Operator with JobSet (but it's not easy to inspect) and just this weekend I put together OSU benchmarks and (a private app) called Netmark, both which are MPI based. Here is the little description / registry page, and tutorials to each (for Python and command line):
The objects are created by the operator, so they aren't explicitly there. Do you want me to throw one together (preferences?) and I can send you all the YAML for it?
I'd love an example with YAML.
okey doke! Here is a quick one! This is an older variant of what the operator makes but it should work. Note that:
Let me know if you have other question, happy to help! Also send along any metrics you like to run for kubernetes (IO/storage/performance/network etc.) I'm adding them to this operator for easy deployment! osu-benchmarks.yaml.txt
And here is a complete updated config with the logging updates I made to the metrics operator:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: metricset-sample
namespace: default
spec:
failurePolicy: {}
network:
enableDNSHostnames: false
subdomain: ms
replicatedJobs:
- name: l
replicas: 1
template:
metadata:
name: metricset-sample
namespace: default
spec:
activeDeadlineSeconds: 31500000
backoffLimit: 100
completionMode: Indexed
completions: 1
parallelism: 1
template:
metadata:
labels:
app.kubernetes.io/name: metricset-sample
cluster-name: metricset-sample
metricset-name: metricset-sample
namespace: default
name: metricset-sample
namespace: default
spec:
containers:
- command:
- /bin/bash
- /metrics_operator/osu-launcher.sh
image: ghcr.io/converged-computing/metric-osu-benchmark:latest
imagePullPolicy: IfNotPresent
name: launcher
resources: {}
stdin: true
tty: true
volumeMounts:
- mountPath: /metrics_operator/
name: metricset-sample
readOnly: true
workingDir: /opt/osu-benchmark/build.openmpi/libexec/osu-micro-benchmarks/mpi/one-sided
restartPolicy: OnFailure
setHostnameAsFQDN: true
shareProcessNamespace: false
subdomain: ms
volumes:
- configMap:
items:
- key: osu-launcher
mode: 511
path: osu-launcher.sh
- key: osu-worker
mode: 511
path: osu-worker.sh
name: metricset-sample
name: metricset-sample
- name: w
replicas: 1
template:
metadata:
name: metricset-sample
namespace: default
spec:
activeDeadlineSeconds: 31500000
backoffLimit: 100
completionMode: Indexed
completions: 1
parallelism: 1
template:
metadata:
labels:
app.kubernetes.io/name: metricset-sample
cluster-name: metricset-sample
metricset-name: metricset-sample
namespace: default
name: metricset-sample
namespace: default
spec:
containers:
- command:
- /bin/bash
- /metrics_operator/osu-worker.sh
image: ghcr.io/converged-computing/metric-osu-benchmark:latest
imagePullPolicy: IfNotPresent
name: workers
resources: {}
stdin: true
tty: true
volumeMounts:
- mountPath: /metrics_operator/
name: metricset-sample
readOnly: true
workingDir: /opt/osu-benchmark/build.openmpi/libexec/osu-micro-benchmarks/mpi/one-sided
restartPolicy: OnFailure
setHostnameAsFQDN: true
shareProcessNamespace: false
subdomain: ms
volumes:
- configMap:
items:
- key: osu-launcher
mode: 511
path: osu-launcher.sh
- key: osu-worker
mode: 511
path: osu-worker.sh
name: metricset-sample
name: metricset-sample
successPolicy:
operator: All
targetReplicatedJobs:
- l
suspend: false
---
apiVersion: v1
kind: ConfigMap
metadata:
name: metricset-sample
namespace: default
data:
osu-launcher: |
#!/bin/bash
# Who are we?
whoami
# Start ssh daemon
/usr/sbin/sshd -D &
# Allow network to ready
echo "Sleeping for 10 seconds waiting for network..."
sleep 10
# Write the hosts file
launcher=$(getent hosts metricset-sample-l-0-0.ms.default.svc.cluster.local | awk '{ print $1 }')
worker=$(getent hosts metricset-sample-w-0-0.ms.default.svc.cluster.local | awk '{ print $1 }')
echo "${launcher}" >> ./hostfile.txt
echo "${worker}" >> ./hostfile.txt
# Show metadata for run
echo "METADATA START {\"pods\":2,\"completions\":2,\"metricName\":\"network-osu-benchmark\",\"metricDescription\":\"point to point MPI benchmarks\",\"metricType\":\"standalone\",\"metricOptions\":{\"completions\":0,\"rate\":10},\"metricListOptions\":{\"commands\":[\"osu_acc_latency\",\"osu_fop_latency\",\"osu_get_acc_latency\",\"osu_get_latency\",\"osu_put_latency\"]}}
METADATA END"
sleep 5
echo METRICS OPERATOR COLLECTION START
echo METRICS OPERATOR TIMEPOINT
echo "mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_acc_latency"
mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_acc_latency
echo METRICS OPERATOR TIMEPOINT
echo "mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_fop_latency"
mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_fop_latency
echo METRICS OPERATOR TIMEPOINT
echo "mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_get_acc_latency"
mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_get_acc_latency
echo METRICS OPERATOR TIMEPOINT
echo "mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_get_latency"
mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_get_latency
echo METRICS OPERATOR TIMEPOINT
echo "mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_put_latency"
mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_put_latency
echo METRICS OPERATOR COLLECTION END
osu-worker: |-
#!/bin/bash
# Who are we?
whoami
# Start ssh daemon
/usr/sbin/sshd -D &
# Allow network to ready
echo "Sleeping for 10 seconds waiting for network..."
sleep 10
# Write the hosts file
launcher=$(getent hosts metricset-sample-l-0-0.ms.default.svc.cluster.local | awk '{ print $1 }')
worker=$(getent hosts metricset-sample-w-0-0.ms.default.svc.cluster.local | awk '{ print $1 }')
echo "${launcher}" >> ./hostfile.txt
echo "${worker}" >> ./hostfile.txt
# Show metadata for run
echo "METADATA START {\"pods\":2,\"completions\":2,\"metricName\":\"network-osu-benchmark\",\"metricDescription\":\"point to point MPI benchmarks\",\"metricType\":\"standalone\",\"metricOptions\":{\"completions\":0,\"rate\":10},\"metricListOptions\":{\"commands\":[\"osu_acc_latency\",\"osu_fop_latency\",\"osu_get_acc_latency\",\"osu_get_latency\",\"osu_put_latency\"]}}
METADATA END"
sleep infinity
---
apiVersion: v1
kind: Service
metadata:
name: ms
namespace: default
spec:
clusterIP: None
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
selector:
metricset-name: metricset-sample
sessionAffinity: None
type: ClusterIP
TLDR it makes it easier to parse the log into a meaningful output, e.g., just parsed and made these plots:
I was able to play around with your example and it worked! I do think some thoughts will be needed on how to make this more general for different number of ranks.
Awesome! Actually that's what the metrics operator does - that particular dump of YAML above is generated from:
apiVersion: flux-framework.org/v1alpha1
kind: MetricSet
metadata:
labels:
app.kubernetes.io/name: metricset
app.kubernetes.io/instance: metricset-sample
name: metricset-sample
spec:
# OSU benchmark is point to point and MUST be run with 2 pods
pods: 2
metrics:
- name: network-osu-benchmark
But also, the OSU benchmarks do not run on more than two ranks. They will throw up on you an error message if you try. If you are interested in that, we have a tool called netmark that makes nice matrices for point to point latency across an entire cluster and are planning to make it public (hopefully) in the next few months! https://github.com/converged-computing/google-batch/blob/main/netmark/img/netmark-experiment-size-100-heatmap.pdf
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
@danielvegamyhre to give some high level comments - for this to work from JobSet, you'd need to add the same logic that we do for the metrics-operator above, or the MPI operator. At the simplest level, you need to have a main pod in the JobSet acting as a launcher (the one running the command) and then the rest are workers. They need to be configured to accept ssh. I don't think ssh is the best way to bootstrap (e.g., flux uses zeromq which we've shown to be faster) but it's a first shot. The design I'd advocate for here isn't something hard coded into JobSet, but rather an easy ability to define some pattern that can map a request to run a command to this design, and install the needed ssh stuff that is missing.
We definitely have models of this working with JobSet - several in fact, and I'm interested to hear your thoughts on such a general approach that would be provided natively (if I understand the gist of this issue correctly).
@danielvegamyhre to give some high level comments - for this to work from JobSet, you'd need to add the same logic that we do for the metrics-operator above, or the MPI operator. At the simplest level, you need to have a main pod in the JobSet acting as a launcher (the one running the command) and then the rest are workers. They need to be configured to accept ssh. I don't think ssh is the best way to bootstrap (e.g., flux uses zeromq which we've shown to be faster) but it's a first shot. The design I'd advocate for here isn't something hard coded into JobSet, but rather an easy ability to define some pattern that can map a request to run a command to this design, and install the needed ssh stuff that is missing.
We definitely have models of this working with JobSet - several in fact, and I'm interested to hear your thoughts on such a general approach that would be provided natively (if I understand the gist of this issue correctly).
Thanks for your thoughts on this! Yeah I prototyped MPI workloads using k8s primitives last year when we were originally researching the requirements for various ML and HPC workload types, to gather the requirements for JobSet and validate the API design.
To prototype MPI using k8s primitives I used an indexed Job, headless service, Secret for ssh key, and a configMaps for hostfile values. However, coming up with a design that can support all this and keeps JobSet generic/flexible and not too bespoke, is pretty challenging. So we prioritized support for other features that were being more commonly requested by users.
Now we are planning to work on support for these more complex workload configurations this year. I discussed this with @ahg-g recently and one idea he had was to support "extensions," where the user can specify an extension in their JobSet spec, and the webhooks + controllers will have points where they call a hook for whatever extensions the JobSet has, and the hook will perform mutation logic (env var configurations, k8s object creation, or whatever else) that needs to be done for this workload type. This way we can avoid intermingling the core JobSet controller logic with logic that is specific to particular workload types.
That's awesome! That last paragraph is literally the design of the metrics-operator - the idea was to use JobSet like legos, adding on whatever additional volumes / commands/ features are needed (these were called addons and the initial metrics themselves were JobSet that could be customized. And it worked pretty well, although I probably went a bit too deep in terms of wanting to play with an interface design. Please ping me if there is something fun to collaborate on.
Actually I just realized @ahg-g was literally sitting on the stage for my talk, so he saw (some of the high level) description about the need for easy to deploy common patterns for jobs. Really excited that you are going to work on this, and please include me if you are able.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
UPDATE: see https://github.com/kubernetes-sigs/jobset/issues/146#issuecomment-1912813909 for latest update to the goals of this issue
What would you like to be added: Official, tested support for MPI jobs. Some things we need to decide:
These questions can also be addressed more broadly in the design for configuration defaulting for certain workload types (PyTorch job, TensorFlow job, MPI job, etc.)
Why is this needed: MPI is a popular parallel computing paradigm with implementations such as OpenMPI, IntelMPI that is commonly used for HPC jobs, and can be modeled as JobSet (this was prototyped and confirmed during the JobSet research and prototyping phase).