Issue connecting to nodes that are not within the same cluster

yxusnapchat commented 1 week ago

Hi team I have a example based on the latest nv image nvcr.io/nvidia/tensorflow:24.07-tf2-py3 but run the mpi job on different nodes. However it complains that the launcher could not identify the worker. Is it supported to have launcher and worker running on separate nodes?

kind: MPIJob
metadata:
  name: xxx
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: xxx
            # env:
            #   - name: TF_USE_LEGACY_KERAS
            #     value: "1" 
            # resources:
            #   limits:
            #     nvidia.com/gpu: 1  # Request 1 GPU
            #   requests:
            #     nvidia.com/gpu: 1  # Optionally set requests equal to limits
            name: mpi-launcher
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /nvidia-examples/movielens-1m-keras-with-horovod.py
            - --mode=train
            - --model_dir="./model_dir" 
            - --export_dir="./export_dir"
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: xxx
            name: mpi-worker
            # env:
            #   - name: TF_USE_LEGACY_KERAS
            #     value: "1" 
            resources:
              limits:
                nvidia.com/gpu: 1  # Request 1 GPU
              requests:
                nvidia.com/gpu: 1  # Optionally set requests equal to limits

Also I am curious on where is the code pointer to start the worker. Thanks!

alculquicondor commented 6 days ago

Yes, it is supported.

You can find examples here https://github.com/kubeflow/mpi-operator/tree/master/examples/v2beta1

alculquicondor commented 6 days ago

Please share more details about error messages in both launcher and worker pods

kubeflow / mpi-operator

Issue connecting to nodes that are not within the same cluster #658