Support for MPI jobs (JobSet extensions)

danielvegamyhre commented 1 year ago

UPDATE: see https://github.com/kubernetes-sigs/jobset/issues/146#issuecomment-1912813909 for latest update to the goals of this issue

What would you like to be added: Official, tested support for MPI jobs. Some things we need to decide:

Will the user or JobSet controller be responsible for creating the ssh config (Secrets)?
Will the user or JobSet controller be responsible for setting up the environment variables configuring slots per host, etc?

These questions can also be addressed more broadly in the design for configuration defaulting for certain workload types (PyTorch job, TensorFlow job, MPI job, etc.)

Why is this needed: MPI is a popular parallel computing paradigm with implementations such as OpenMPI, IntelMPI that is commonly used for HPC jobs, and can be modeled as JobSet (this was prototyped and confirmed during the JobSet research and prototyping phase).

kannon92 commented 11 months ago

Hey @vsoch. Have you got a working example with MPI and JobSet? I know you are doing some benchmarks for K8s/HPC so was just curious if you got MPI working with jobSet.

vsoch commented 11 months ago

Hey @vsoch. Have you got a working example with MPI and JobSet? I know you are doing some benchmarks for K8s/HPC so was just curious if you got MPI working with jobSet.

I have several! I have a prototype of the Flux Operator with JobSet (but it's not easy to inspect) and just this weekend I put together OSU benchmarks and (a private app) called Netmark, both which are MPI based. Here is the little description / registry page, and tutorials to each (for Python and command line):

The objects are created by the operator, so they aren't explicitly there. Do you want me to throw one together (preferences?) and I can send you all the YAML for it?

kannon92 commented 11 months ago

I'd love an example with YAML.

vsoch commented 11 months ago

okey doke! Here is a quick one! This is an older variant of what the operator makes but it should work. Note that:

the headless service is needed for the network
the hostnames are derived (ip addresses) for the job that is given in a text file list to mpirun
the communication happens via ssh
config maps provide the entrypoints for worker/launcher
you will probably want to add resource limits/requests so there is one worker or launcher per node
you'll also want to adjust the tasks based on your resources
OSU benchmarks are point to point and intended to be run on two nodes (one launcher, one worker) this is different than many/most MPI apps. My operator checks that the number of pods is exactly 2 but there is no limit here.

Let me know if you have other question, happy to help! Also send along any metrics you like to run for kubernetes (IO/storage/performance/network etc.) I'm adding them to this operator for easy deployment! osu-benchmarks.yaml.txt

vsoch commented 11 months ago

And here is a complete updated config with the logging updates I made to the metrics operator:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: metricset-sample
  namespace: default
spec:
  failurePolicy: {}
  network:
    enableDNSHostnames: false
    subdomain: ms
  replicatedJobs:
  - name: l
    replicas: 1
    template:
      metadata:
        name: metricset-sample
        namespace: default
      spec:
        activeDeadlineSeconds: 31500000
        backoffLimit: 100
        completionMode: Indexed
        completions: 1
        parallelism: 1
        template:
          metadata:
            labels:
              app.kubernetes.io/name: metricset-sample
              cluster-name: metricset-sample
              metricset-name: metricset-sample
              namespace: default
            name: metricset-sample
            namespace: default
          spec:
            containers:
            - command:
              - /bin/bash
              - /metrics_operator/osu-launcher.sh
              image: ghcr.io/converged-computing/metric-osu-benchmark:latest
              imagePullPolicy: IfNotPresent
              name: launcher
              resources: {}
              stdin: true
              tty: true
              volumeMounts:
              - mountPath: /metrics_operator/
                name: metricset-sample
                readOnly: true
              workingDir: /opt/osu-benchmark/build.openmpi/libexec/osu-micro-benchmarks/mpi/one-sided
            restartPolicy: OnFailure
            setHostnameAsFQDN: true
            shareProcessNamespace: false
            subdomain: ms
            volumes:
            - configMap:
                items:
                - key: osu-launcher
                  mode: 511
                  path: osu-launcher.sh
                - key: osu-worker
                  mode: 511
                  path: osu-worker.sh
                name: metricset-sample
              name: metricset-sample
  - name: w
    replicas: 1
    template:
      metadata:
        name: metricset-sample
        namespace: default
      spec:
        activeDeadlineSeconds: 31500000
        backoffLimit: 100
        completionMode: Indexed
        completions: 1
        parallelism: 1
        template:
          metadata:
            labels:
              app.kubernetes.io/name: metricset-sample
              cluster-name: metricset-sample
              metricset-name: metricset-sample
              namespace: default
            name: metricset-sample
            namespace: default
          spec:
            containers:
            - command:
              - /bin/bash
              - /metrics_operator/osu-worker.sh
              image: ghcr.io/converged-computing/metric-osu-benchmark:latest
              imagePullPolicy: IfNotPresent
              name: workers
              resources: {}
              stdin: true
              tty: true
              volumeMounts:
              - mountPath: /metrics_operator/
                name: metricset-sample
                readOnly: true
              workingDir: /opt/osu-benchmark/build.openmpi/libexec/osu-micro-benchmarks/mpi/one-sided
            restartPolicy: OnFailure
            setHostnameAsFQDN: true
            shareProcessNamespace: false
            subdomain: ms
            volumes:
            - configMap:
                items:
                - key: osu-launcher
                  mode: 511
                  path: osu-launcher.sh
                - key: osu-worker
                  mode: 511
                  path: osu-worker.sh
                name: metricset-sample
              name: metricset-sample
  successPolicy:
    operator: All
    targetReplicatedJobs:
    - l
  suspend: false
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: metricset-sample
  namespace: default
data:
  osu-launcher: |
    #!/bin/bash
    # Who are we?
    whoami
    # Start ssh daemon
    /usr/sbin/sshd -D &

    # Allow network to ready
    echo "Sleeping for 10 seconds waiting for network..."
    sleep 10

    # Write the hosts file
    launcher=$(getent hosts metricset-sample-l-0-0.ms.default.svc.cluster.local | awk '{ print $1 }')
    worker=$(getent hosts metricset-sample-w-0-0.ms.default.svc.cluster.local | awk '{ print $1 }')
    echo "${launcher}" >> ./hostfile.txt
    echo "${worker}" >> ./hostfile.txt

    # Show metadata for run
    echo "METADATA START {\"pods\":2,\"completions\":2,\"metricName\":\"network-osu-benchmark\",\"metricDescription\":\"point to point MPI benchmarks\",\"metricType\":\"standalone\",\"metricOptions\":{\"completions\":0,\"rate\":10},\"metricListOptions\":{\"commands\":[\"osu_acc_latency\",\"osu_fop_latency\",\"osu_get_acc_latency\",\"osu_get_latency\",\"osu_put_latency\"]}}
    METADATA END"

    sleep 5
    echo METRICS OPERATOR COLLECTION START
    echo METRICS OPERATOR TIMEPOINT
    echo "mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_acc_latency"
    mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_acc_latency
    echo METRICS OPERATOR TIMEPOINT
    echo "mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_fop_latency"
    mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_fop_latency
    echo METRICS OPERATOR TIMEPOINT
    echo "mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_get_acc_latency"
    mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_get_acc_latency
    echo METRICS OPERATOR TIMEPOINT
    echo "mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_get_latency"
    mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_get_latency
    echo METRICS OPERATOR TIMEPOINT
    echo "mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_put_latency"
    mpirun --hostfile ./hostfile.txt --allow-run-as-root -np 2 ./osu_put_latency
    echo METRICS OPERATOR COLLECTION END
  osu-worker: |-
    #!/bin/bash
    # Who are we?
    whoami
    # Start ssh daemon
    /usr/sbin/sshd -D &

    # Allow network to ready
    echo "Sleeping for 10 seconds waiting for network..."
    sleep 10

    # Write the hosts file
    launcher=$(getent hosts metricset-sample-l-0-0.ms.default.svc.cluster.local | awk '{ print $1 }')
    worker=$(getent hosts metricset-sample-w-0-0.ms.default.svc.cluster.local | awk '{ print $1 }')
    echo "${launcher}" >> ./hostfile.txt
    echo "${worker}" >> ./hostfile.txt

    # Show metadata for run
    echo "METADATA START {\"pods\":2,\"completions\":2,\"metricName\":\"network-osu-benchmark\",\"metricDescription\":\"point to point MPI benchmarks\",\"metricType\":\"standalone\",\"metricOptions\":{\"completions\":0,\"rate\":10},\"metricListOptions\":{\"commands\":[\"osu_acc_latency\",\"osu_fop_latency\",\"osu_get_acc_latency\",\"osu_get_latency\",\"osu_put_latency\"]}}
    METADATA END"

    sleep infinity
---
apiVersion: v1
kind: Service
metadata:
  name: ms
  namespace: default
spec:
  clusterIP: None
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  selector:
    metricset-name: metricset-sample
  sessionAffinity: None
  type: ClusterIP

TLDR it makes it easier to parse the log into a meaningful output, e.g., just parsed and made these plots:

https://github.com/converged-computing/operator-experiments/tree/main/google/service-timing/run14/c2-node-60-240#analyze

kannon92 commented 10 months ago

I was able to play around with your example and it worked! I do think some thoughts will be needed on how to make this more general for different number of ranks.

vsoch commented 10 months ago

Awesome! Actually that's what the metrics operator does - that particular dump of YAML above is generated from:

apiVersion: flux-framework.org/v1alpha1
kind: MetricSet
metadata:
  labels:
    app.kubernetes.io/name: metricset
    app.kubernetes.io/instance: metricset-sample
  name: metricset-sample
spec:
  # OSU benchmark is point to point and MUST be run with 2 pods
  pods: 2
  metrics:
   - name: network-osu-benchmark

But also, the OSU benchmarks do not run on more than two ranks. They will throw up on you an error message if you try. If you are interested in that, we have a tool called netmark that makes nice matrices for point to point latency across an entire cluster and are planning to make it public (hopefully) in the next few months! https://github.com/converged-computing/google-batch/blob/main/netmark/img/netmark-experiment-size-100-heatmap.pdf

k8s-triage-robot commented 5 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ahg-g commented 5 months ago

/remove-lifecycle stale

vsoch commented 5 months ago

@danielvegamyhre to give some high level comments - for this to work from JobSet, you'd need to add the same logic that we do for the metrics-operator above, or the MPI operator. At the simplest level, you need to have a main pod in the JobSet acting as a launcher (the one running the command) and then the rest are workers. They need to be configured to accept ssh. I don't think ssh is the best way to bootstrap (e.g., flux uses zeromq which we've shown to be faster) but it's a first shot. The design I'd advocate for here isn't something hard coded into JobSet, but rather an easy ability to define some pattern that can map a request to run a command to this design, and install the needed ssh stuff that is missing.

We definitely have models of this working with JobSet - several in fact, and I'm interested to hear your thoughts on such a general approach that would be provided natively (if I understand the gist of this issue correctly).

danielvegamyhre commented 5 months ago

@danielvegamyhre to give some high level comments - for this to work from JobSet, you'd need to add the same logic that we do for the metrics-operator above, or the MPI operator. At the simplest level, you need to have a main pod in the JobSet acting as a launcher (the one running the command) and then the rest are workers. They need to be configured to accept ssh. I don't think ssh is the best way to bootstrap (e.g., flux uses zeromq which we've shown to be faster) but it's a first shot. The design I'd advocate for here isn't something hard coded into JobSet, but rather an easy ability to define some pattern that can map a request to run a command to this design, and install the needed ssh stuff that is missing.

We definitely have models of this working with JobSet - several in fact, and I'm interested to hear your thoughts on such a general approach that would be provided natively (if I understand the gist of this issue correctly).

Thanks for your thoughts on this! Yeah I prototyped MPI workloads using k8s primitives last year when we were originally researching the requirements for various ML and HPC workload types, to gather the requirements for JobSet and validate the API design.

To prototype MPI using k8s primitives I used an indexed Job, headless service, Secret for ssh key, and a configMaps for hostfile values. However, coming up with a design that can support all this and keeps JobSet generic/flexible and not too bespoke, is pretty challenging. So we prioritized support for other features that were being more commonly requested by users.

Now we are planning to work on support for these more complex workload configurations this year. I discussed this with @ahg-g recently and one idea he had was to support "extensions," where the user can specify an extension in their JobSet spec, and the webhooks + controllers will have points where they call a hook for whatever extensions the JobSet has, and the hook will perform mutation logic (env var configurations, k8s object creation, or whatever else) that needs to be done for this workload type. This way we can avoid intermingling the core JobSet controller logic with logic that is specific to particular workload types.

vsoch commented 5 months ago

That's awesome! That last paragraph is literally the design of the metrics-operator - the idea was to use JobSet like legos, adding on whatever additional volumes / commands/ features are needed (these were called addons and the initial metrics themselves were JobSet that could be customized. And it worked pretty well, although I probably went a bit too deep in terms of wanting to play with an interface design. Please ping me if there is something fun to collaborate on.

vsoch commented 5 months ago

Actually I just realized @ahg-g was literally sitting on the stage for my talk, so he saw (some of the high level) description about the need for easy to deploy common patterns for jobs. Really excited that you are going to work on this, and please include me if you are able.

k8s-triage-robot commented 1 month ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

ahg-g commented 1 month ago

/remove-lifecycle stale

kubernetes-sigs / jobset

Support for MPI jobs (JobSet extensions) #146