MPIJob Pods shows status RUNNING despite MPIJob Completed

asahalyft commented 3 years ago

I am testing out TFJob and MPIJob operators from kubeflow/manifests-v1.1.0 branch on AWS EKS K8s=1.14. I am able to schedule TFJob and MPIJob successfully and these jobs also complete fine. I have verified it by watching the k8s events that I get MPIJob completed successfully.

However, I observe that the worker pods still keep on running even after the launcher pod and the MPI Job has completed. I see this behavior for TFJob as well.

I have applied the following yaml for the MPIJob.

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: tf2-keras-mnist-mpi-gpu
spec:
  slotsPerWorker: 1
  cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: docker.io/horovod/horovod:0.19.3-tf2.1.0-torch-mxnet1.6.0-py3.6-gpu
            name: keras-mnist-mpi-launcher
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /examples/tensorflow2_keras_mnist.py
            resources:
              limits:
                cpu: 1
                memory: 2Gi
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: docker.io/horovod/horovod:0.19.3-tf2.1.0-torch-mxnet1.6.0-py3.6-gpu
            name: keras-mnist-mpi-worker
            resources:
              limits:
                nvidia.com/gpu: 1

I collected the mpi controller logs which clearly shows that the MPIJob is indeed completed but the controller is failing to sync that info.

Observe the pods in my namespace:

(base) asaha-mbp151:exploration asaha$ kubectl get pods -n asaha
NAME                                     READY   STATUS      RESTARTS   AGE
tf2-keras-mnist-mpi-gpu-launcher-zj6gv   0/1     Completed   0          16m
tf2-keras-mnist-mpi-gpu-worker-0         1/1     Running     0          16m
tf2-keras-mnist-mpi-gpu-worker-1         1/1     Running     0          16m

Observe the mpi controller logs in kubeflow namespace:

(base) asaha-mbp151:exploration asaha$ kubectl get pods -n kubeflow
NAME                            READY   STATUS    RESTARTS   AGE
mpi-operator-5f7f4d94cc-qjzb9   1/1     Running   0          27m
(base) asaha-mbp151:exploration asaha$ kubectl logs -f mpi-operator-5f7f4d94cc-qjzb9 -n kubeflow
I1125 14:30:17.716033       1 server.go:88] Using cluster scoped operator
I1125 14:30:17.716068       1 server.go:94] [API Version: v1alpha2 Version: v0.2.2 Git SHA: 75f424a802dafb3662bc5c76b8f3c3cb60127fac Built: 2020-10-20 00:41:20 Go Version: go1.13.6 Go OS/Arch: linux/amd64]
I1125 14:30:17.716082       1 server.go:97] Server options: &{Kubeconfig: MasterURL: KubectlDeliveryImage:mpioperator/kubectl-delivery:latest Threadiness:2 MonitoringPort:0 PrintVersion:false GangSchedulingName: Namespace: LockNamespace:kubeflow}
W1125 14:30:17.716158       1 client_config.go:541] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I1125 14:30:17.742811       1 leaderelection.go:235] attempting to acquire leader lease  kubeflow/mpi-operator...
I1125 14:30:17.742832       1 server.go:204] Start listening to 8080 for health check
I1125 14:30:17.747630       1 server.go:253] New leader has been elected: mpi-operator-5f7f4d94cc-brxv8_9f52d006-afea-4705-98ac-1c7108b75828
I1125 14:30:34.722869       1 leaderelection.go:245] successfully acquired lease kubeflow/mpi-operator
I1125 14:30:34.722979       1 server.go:242] Leading started
I1125 14:30:34.722986       1 event.go:258] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"kubeflow", Name:"mpi-operator", UID:"9315cf6d-2e88-11eb-8bf4-0645e9fd4c9d", APIVersion:"v1", ResourceVersion:"259432", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' mpi-operator-5f7f4d94cc-qjzb9_235a8483-1d7b-4634-a9cf-84476af9f050 became leader
I1125 14:30:34.723214       1 mpi_job_controller.go:227] Setting up event handlers
I1125 14:30:34.723301       1 mpi_job_controller.go:361] Starting MPIJob controller
I1125 14:30:34.723328       1 mpi_job_controller.go:364] Waiting for informer caches to sync
I1125 14:30:34.823445       1 mpi_job_controller.go:374] Starting workers
I1125 14:30:34.823471       1 mpi_job_controller.go:380] Started workers
I1125 14:39:21.748875       1 mpi_job_controller.go:455] Finished syncing job "asaha/tf2-keras-mnist-mpi-gpu" (118.401638ms)
E1125 14:39:21.748911       1 mpi_job_controller.go:442] error syncing 'asaha/tf2-keras-mnist-mpi-gpu': the server could not find the requested resource (put mpijobs.kubeflow.org tf2-keras-mnist-mpi-gpu)
I1125 14:41:04.270894       1 mpi_job_controller.go:455] Finished syncing job "asaha/tf2-keras-mnist-mpi-gpu" (1.746921ms)
E1125 14:41:04.270926       1 mpi_job_controller.go:442] error syncing 'asaha/tf2-keras-mnist-mpi-gpu': the server could not find the requested resource (put mpijobs.kubeflow.org tf2-keras-mnist-mpi-gpu)
I1125 14:43:48.113053       1 mpi_job_controller.go:455] Finished syncing job "asaha/tf2-keras-mnist-mpi-gpu" (1.998845ms)
E1125 14:43:48.113093       1 mpi_job_controller.go:442] error syncing 'asaha/tf2-keras-mnist-mpi-gpu': the server could not find the requested resource (put mpijobs.kubeflow.org tf2-keras-mnist-mpi-gpu)
I1125 14:54:43.473321       1 event.go:258] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"asaha", Name:"tf2-keras-mnist-mpi-gpu", UID:"018ee039-2f2c-11eb-8bf4-0645e9fd4c9d", APIVersion:"kubeflow.org/v1alpha2", ResourceVersion:"260900", FieldPath:""}): type: 'Normal' reason: 'MPIJobSucceeded' MPIJob asaha/tf2-keras-mnist-mpi-gpu successfully completed.
I1125 14:54:43.476006       1 mpi_job_controller.go:455] Finished syncing job "asaha/tf2-keras-mnist-mpi-gpu" (2.807611ms)
E1125 14:54:43.476037       1 mpi_job_controller.go:442] error syncing 'asaha/tf2-keras-mnist-mpi-gpu': the server could not find the requested resource (put mpijobs.kubeflow.org tf2-keras-mnist-mpi-gpu)

Describe the MPIJob:

kubectl describe mpijob tf2-keras-mnist-mpi-gpu -n asaha
Name:         tf2-keras-mnist-mpi-gpu
Namespace:    asaha
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2020-11-25T14:39:21Z
  Generation:          1
  Resource Version:    260900
  Self Link:           /apis/kubeflow.org/v1/namespaces/asaha/mpijobs/tf2-keras-mnist-mpi-gpu
  UID:                 018ee039-2f2c-11eb-8bf4-0645e9fd4c9d
Spec:
  Clean Pod Policy:  Running
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Args:
              -np
              2
              --allow-run-as-root
              -bind-to
              none
              -map-by
              slot
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              /examples/tensorflow2_keras_mnist.py
            Command:
              mpirun
            Image:  docker.io/horovod/horovod:0.19.3-tf2.1.0-torch-mxnet1.6.0-py3.6-gpu
            Name:   keras-mnist-mpi-launcher
            Resources:
              Limits:
                Cpu:     1
                Memory:  2Gi
    Worker:
      Replicas:  2
      Template:
        Spec:
          Containers:
            Image:  docker.io/horovod/horovod:0.19.3-tf2.1.0-torch-mxnet1.6.0-py3.6-gpu
            Name:   keras-mnist-mpi-worker
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Slots Per Worker:              1
Events:
  Type    Reason           Age                 From                Message
  ----    ------           ----                ----                -------
  Normal  MPIJobSucceeded  100s (x3 over 35m)  mpi-job-controller  MPIJob asaha/tf2-keras-mnist-mpi-gpu successfully completed.

gaocegege commented 3 years ago

It is weird, I will have a look. Thanks for the report!

gaocegege commented 3 years ago

cc @Jeffwan @PatrickXYS

terrytangyuan commented 3 years ago

Could you also share logs from worker pods that kept running?

asahalyft commented 3 years ago

sure @terrytangyuan

The source code that is being referred in my test yaml posted above is from the examples directory of the horovod project itself https://github.com/horovod/horovod/blob/master/examples/tensorflow2/tensorflow2_keras_mnist.py

P.S. I observe this same behavior that worker pods keep on running also with the provided examples in the mpi-operator repo https://github.com/kubeflow/mpi-operator/blob/master/examples/v1/tensorflow-benchmarks.yaml

Logs from the Launcher Pod which completed

(base) asaha-mbp151:exploration asaha$ kubectl logs tf2-keras-mnist-mpi-gpu-launcher-8jlf6 -n asaha
+ POD_NAME=tf2-keras-mnist-mpi-gpu-worker-0
+ [ t = - ]
+ shift
+ /opt/kube/kubectl cp /opt/kube/hosts tf2-keras-mnist-mpi-gpu-worker-0:/etc/hosts_of_nodes
+ POD_NAME=tf2-keras-mnist-mpi-gpu-worker-1
+ [ t = - ]
+ shift
+ /opt/kube/kubectl cp /opt/kube/hosts tf2-keras-mnist-mpi-gpu-worker-1:/etc/hosts_of_nodes
+ /opt/kube/kubectl exec tf2-keras-mnist-mpi-gpu-worker-0 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts &&     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2374565888" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "tf[1:2]-keras-mnist-mpi-gpu-launcher-8jlf6,tf[1:2]-keras-mnist-mpi-gpu-worker-0,tf[1:2]-keras-mnist-mpi-gpu-worker-1@0(3)" -mca orte_hnp_uri "2374565888.0;tcp://192.168.33.8:50987" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "2374565888.0;tcp://192.168.33.8:50987" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
+ /opt/kube/kubectl exec tf2-keras-mnist-mpi-gpu-worker-1 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts &&     PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ;   /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2374565888" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "tf[1:2]-keras-mnist-mpi-gpu-launcher-8jlf6,tf[1:2]-keras-mnist-mpi-gpu-worker-0,tf[1:2]-keras-mnist-mpi-gpu-worker-1@0(3)" -mca orte_hnp_uri "2374565888.0;tcp://192.168.33.8:50987" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "2374565888.0;tcp://192.168.33.8:50987" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
2020-11-26 09:43:10.768117: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-26 09:43:10.768258: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-26 09:43:10.768281: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-11-26 09:43:10.821745: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-26 09:43:10.821897: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-26 09:43:10.821922: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-11-26 09:43:11.648769: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-11-26 09:43:11.649522: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-11-26 09:43:11.671662: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.672400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:00:1e.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-11-26 09:43:11.672453: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:11.674625: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:11.676512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-11-26 09:43:11.676898: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-11-26 09:43:11.679141: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-11-26 09:43:11.680413: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-11-26 09:43:11.683938: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.684966: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-26 09:43:11.685145: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.685772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:00:1b.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-11-26 09:43:11.685822: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:11.686063: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.686820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
2020-11-26 09:43:11.687864: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:11.689874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-11-26 09:43:11.690250: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-11-26 09:43:11.692345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-11-26 09:43:11.693626: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-11-26 09:43:11.698189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-26 09:43:11.698290: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.700139: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.701903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
11493376/11490434 [==============================] - 0s 0us/step
2020-11-26 09:43:12.455964: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-11-26 09:43:12.481294: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300010000 Hz
2020-11-26 09:43:12.483565: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x53037b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-26 09:43:12.483618: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-11-26 09:43:12.497975: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-11-26 09:43:12.504248: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300060000 Hz
2020-11-26 09:43:12.504602: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4992550 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-26 09:43:12.504629: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-11-26 09:43:12.562221: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.563112: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x49e2c20 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-11-26 09:43:12.563142: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-11-26 09:43:12.563341: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.564093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:00:1e.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-11-26 09:43:12.564160: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:12.564226: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:12.564261: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-11-26 09:43:12.564300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-11-26 09:43:12.564335: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-11-26 09:43:12.564374: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-11-26 09:43:12.564406: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-26 09:43:12.564486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.565253: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.566020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-11-26 09:43:12.566080: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:12.577038: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.578912: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x53012a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-11-26 09:43:12.578941: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-11-26 09:43:12.579103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.580847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:00:1b.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-11-26 09:43:12.580897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:12.580936: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:12.580963: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-11-26 09:43:12.580979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-11-26 09:43:12.580993: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-11-26 09:43:12.581009: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-11-26 09:43:12.581024: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-26 09:43:12.581079: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.582886: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.584583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-11-26 09:43:12.584636: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:12.634981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-26 09:43:12.635035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-11-26 09:43:12.635048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-11-26 09:43:12.635306: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.636123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.636874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10798 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2020-11-26 09:43:12.652102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-26 09:43:12.652145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-11-26 09:43:12.652154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-11-26 09:43:12.652446: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.654295: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.656028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10798 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1b.0, compute capability: 3.7)
Epoch 1/24
2020-11-26 09:43:14.498415: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:14.527299: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:14.687915: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-26 09:43:14.713911: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.242629). Check your callbacks.
WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.242464). Check your callbacks.
250/250 [==============================] - 9s 34ms/step - loss: 0.2967 - accuracy: 0.8145
Epoch 2/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0954 - accuracy: 0.9701
Epoch 3/24
248/250 [============================>.] - ETA: 0s - loss: 0.0719 - accuracy: 0.9772
Epoch 3: finished gradual learning rate warmup to 0.002.

Epoch 3: finished gradual learning rate warmup to 0.002.
250/250 [==============================] - 8s 31ms/step - loss: 0.0725 - accuracy: 0.9773
Epoch 4/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0601 - accuracy: 0.9813
Epoch 5/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0499 - accuracy: 0.9848
Epoch 6/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0406 - accuracy: 0.9858
Epoch 7/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0413 - accuracy: 0.9866
Epoch 8/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0346 - accuracy: 0.9888
Epoch 9/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0326 - accuracy: 0.9897
Epoch 10/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0277 - accuracy: 0.9909
Epoch 11/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0248 - accuracy: 0.9908
Epoch 12/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0232 - accuracy: 0.9916
Epoch 13/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0212 - accuracy: 0.9925
Epoch 14/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0197 - accuracy: 0.9936
Epoch 15/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0215 - accuracy: 0.9930
Epoch 16/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0191 - accuracy: 0.9931
Epoch 17/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0206 - accuracy: 0.9933
Epoch 18/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0172 - accuracy: 0.9947
Epoch 19/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0173 - accuracy: 0.9950
Epoch 20/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0146 - accuracy: 0.9953
Epoch 21/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0136 - accuracy: 0.9948
Epoch 22/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0161 - accuracy: 0.9944
Epoch 23/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0161 - accuracy: 0.9945
Epoch 24/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0167 - accuracy: 0.9950
(base) asaha-mbp151:exploration asaha$

Logs from the Worker Pod which still keep on running I do not see any log from the worker pod.

(base) asaha-mbp151:exploration asaha$ kubectl get pods -n asaha
NAME                                     READY   STATUS      RESTARTS   AGE
tf2-keras-mnist-mpi-gpu-launcher-8jlf6   0/1     Completed   0          6m5s
tf2-keras-mnist-mpi-gpu-worker-0         1/1     Running     0          6m5s
tf2-keras-mnist-mpi-gpu-worker-1         1/1     Running     0          6m5s
(base) asaha-mbp151:exploration asaha$ kubectl logs tf2-keras-mnist-mpi-gpu-worker-0 -n asaha
(base) asaha-mbp151:exploration asaha$

carmark commented 3 years ago

@asahalyft The worker pods status is right, the mpijob's status should be synced with the launcher pod. I think the main problem is caused by the sync error.

asahalyft commented 3 years ago

@carmark @terrytangyuan @gaocegege As I read through the controller codes I understood from https://github.com/kubeflow/mpi-operator/blob/75f424a802dafb3662bc5c76b8f3c3cb60127fac/pkg/controllers/v1/mpi_job_controller.go#L471 this is where the syncing logic is written to kill the worker pods when the MPIJob has completed. However, that is not happening.

I have been able to reproduce this same error using the same yaml on a non AWS EKS but on an AWS on-prem K8s 1.14 cluster as well. Are you able to reproduce the error on your side?

It would be really helpful if you all could 👀 and help in resolving the issue.

asahalyft commented 3 years ago

Also, for context/completeness, I am installing only the mpi operator component of Kubeflow and not the entire Kubeflow installation.

I applied the mpi controller as follows:

kustomize build manifests-v1.1.0-branch/mpi-job/mpi-operator/overlays/application/ > lyft-mpi-operator.yaml
kubectl apply -f lyft-mpi-operator.yaml

This is the resultant lyft-mpi-operator.yaml that got generated from kustomize step.

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpijobs.kubeflow.org
spec:
  group: kubeflow.org
  names:
    kind: MPIJob
    plural: mpijobs
    shortNames:
    - mj
    - mpij
    singular: mpijob
  scope: Namespaced
  versions:
  - name: v1alpha1
    schema:
      openAPIV3Schema:
        properties:
          spec:
            description: Only one of gpus, processingUnits, or replicas should be
              specified
            oneOf:
            - properties:
                gpus:
                  description: Valid values are 1, 2, 4, or any multiple of 8
                  oneOf:
                  - enum:
                    - 1
                    - 2
                    - 4
                    type: integer
                  - minimum: 8
                    multipleOf: 8
                    type: integer
                  title: Total number of GPUs
                gpusPerNode:
                  description: Defaults to the number of GPUs per worker
                  minimum: 1
                  title: The maximum number of GPUs available per node
                  type: integer
                slotsPerWorker:
                  description: Defaults to the number of processing units per worker
                  minimum: 1
                  title: The number of slots per worker used in hostfile
                  type: integer
              required:
              - gpus
            - properties:
                processingResourceType:
                  description: Defaults to 'nvidia.com/gpu'
                  enum:
                  - nvidia.com/gpu
                  - cpu
                  title: The processing resource type, e.g. 'nvidia.com/gpu' or 'cpu'
                  type: string
                processingUnits:
                  description: Valid values are 1, 2, 4, or any multiple of 8
                  oneOf:
                  - enum:
                    - 1
                    - 2
                    - 4
                    type: integer
                  - minimum: 8
                    multipleOf: 8
                    type: integer
                  title: Total number of processing units
                processingUnitsPerNode:
                  description: Defaults to the number of processing units per worker
                  minimum: 1
                  title: The maximum number of processing units available per node
                  type: integer
                slotsPerWorker:
                  description: Defaults to the number of processing units per worker
                  minimum: 1
                  title: The number of slots per worker used in hostfile
                  type: integer
              required:
              - processingUnits
            - properties:
                processingResourceType:
                  description: Defaults to 'nvidia.com/gpu'
                  enum:
                  - nvidia.com/gpu
                  - cpu
                  title: The processing resource type, e.g. 'nvidia.com/gpu' or 'cpu'
                  type: string
                replicas:
                  description: The processing resource limit should be specified for
                    each replica
                  minimum: 1
                  title: Total number of replicas
                  type: integer
                slotsPerWorker:
                  description: Defaults to the number of processing units per worker
                  minimum: 1
                  title: The number of slots per worker used in hostfile
                  type: integer
              required:
              - replicas
            title: The MPIJob spec
    served: false
    storage: false
  - name: v1alpha2
    schema:
      openAPIV3Schema:
        properties:
          spec:
            properties:
              mpiReplicaSpecs:
                properties:
                  Launcher:
                    properties:
                      replicas:
                        maximum: 1
                        minimum: 1
                        type: integer
                  Worker:
                    properties:
                      replicas:
                        minimum: 1
                        type: integer
              slotsPerWorker:
                minimum: 1
                type: integer
    served: true
    storage: false
  - name: v1
    schema:
      openAPIV3Schema:
        properties:
          spec:
            properties:
              mpiReplicaSpecs:
                properties:
                  Launcher:
                    properties:
                      replicas:
                        maximum: 1
                        minimum: 1
                        type: integer
                  Worker:
                    properties:
                      replicas:
                        minimum: 1
                        type: integer
              slotsPerWorker:
                minimum: 1
                type: integer
    served: true
    storage: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: mpi-operator
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator
  namespace: kubeflow
---
aggregationRule:
  clusterRoleSelectors:
  - matchLabels:
      rbac.authorization.kubeflow.org/aggregate-to-kubeflow-mpijobs-admin: "true"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
    rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin: "true"
  name: kubeflow-mpijobs-admin
rules: []
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
    rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true"
    rbac.authorization.kubeflow.org/aggregate-to-kubeflow-mpijobs-admin: "true"
  name: kubeflow-mpijobs-edit
rules:
- apiGroups:
  - kubeflow.org
  resources:
  - mpijobs
  - mpijobs/status
  verbs:
  - get
  - list
  - watch
  - create
  - delete
  - deletecollection
  - patch
  - update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
    rbac.authorization.kubeflow.org/aggregate-to-kubeflow-view: "true"
  name: kubeflow-mpijobs-view
rules:
- apiGroups:
  - kubeflow.org
  resources:
  - mpijobs
  - mpijobs/status
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app: mpi-operator
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - serviceaccounts
  verbs:
  - create
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/exec
  verbs:
  - create
- apiGroups:
  - ""
  resources:
  - endpoints
  verbs:
  - create
  - get
  - update
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - roles
  - rolebindings
  verbs:
  - create
  - list
  - watch
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - create
  - list
  - update
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  verbs:
  - create
  - list
  - update
  - watch
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - create
  - list
  - update
  - watch
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - create
  - get
- apiGroups:
  - kubeflow.org
  resources:
  - mpijobs
  - mpijobs/finalizers
  - mpijobs/status
  verbs:
  - '*'
- apiGroups:
  - scheduling.incubator.k8s.io
  - scheduling.sigs.dev
  resources:
  - queues
  - podgroups
  verbs:
  - '*'
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app: mpi-operator
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: mpi-operator
subjects:
- kind: ServiceAccount
  name: mpi-operator
  namespace: kubeflow
---
apiVersion: v1
data:
  kubectl-delivery-image: docker.io/mpioperator/kubectl-delivery:latest
  lock-namespace: kubeflow
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator-config
  namespace: kubeflow
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator
  namespace: kubeflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mpi-operator
      app.kubernetes.io/component: mpijob
      app.kubernetes.io/name: mpi-operator
      kustomize.component: mpi-operator
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      labels:
        app: mpi-operator
        app.kubernetes.io/component: mpijob
        app.kubernetes.io/name: mpi-operator
        kustomize.component: mpi-operator
    spec:
      containers:
      - args:
        - -alsologtostderr
        - --lock-namespace
        - kubeflow
        - --kubectl-delivery-image
        - docker.io/mpioperator/kubectl-delivery:latest
        image: docker.io/mpioperator/mpi-operator:latest
        imagePullPolicy: Always
        name: mpi-operator
      serviceAccountName: mpi-operator
---
apiVersion: app.k8s.io/v1beta1
kind: Application
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
  name: mpi-operator
spec:
  componentKinds:
  - group: apps
    kind: Deployment
  - group: core
    kind: ServiceAccount
  - group: kubeflow.org
    kind: MPIJob
  descriptor:
    description: Mpi-operator allows users to create and manage the "MPIJob" custom
      resource.
    keywords:
    - mpijob
    - mpi-operator
    links:
    - description: About
      url: https://github.com/kubeflow/mpi-operator
    maintainers:
    - email: rong.ou@gmail.com
      name: Rong Ou
    - email: terrytangyuan@gmail.com
      name: Yuan Tang
    - email: stp.abhi@gmail.com
      name: Abhilash Pallerlamudi
    owners:
    - email: rong.ou@gmail.com
      name: Rong Ou
    - email: terrytangyuan@gmail.com
      name: Yuan Tang
    type: mpi-operator
    version: v1
  selector:
    matchLabels:
      app.kubernetes.io/component: mpijob
      app.kubernetes.io/instance: mpi-operator
      app.kubernetes.io/managed-by: kfctl
      app.kubernetes.io/name: mpi-operator
      app.kubernetes.io/part-of: kubeflow
      app.kubernetes.io/version: v1.0

carmark commented 3 years ago

@terrytangyuan Do u know which commit does the mpioperator/mpi-operator:latest base?

terrytangyuan commented 3 years ago

It should be based on the latest commit https://github.com/kubeflow/mpi-operator/commit/75f424a802dafb3662bc5c76b8f3c3cb60127fac

asahalyft commented 3 years ago

@carmark @terrytangyuan @gaocegege Following up on this thread to check if you could 👀 and help in investigating the issue.

qifengz commented 3 years ago

met the same issue.

qifengz commented 3 years ago

@asahalyft I think the main point of error logs is "the server could not find the requested resource (put mpijobs.kubeflow.org ...". It means being a lack of subresources in crd manifest. After I added subresources to CRD and rerun it, resolved this issue for me:

carmark commented 3 years ago

@asahalyft Could you please try the suggestion of @qifengz ?

chongchuanbing commented 3 years ago

Also, for context/completeness, I am installing only the mpi operator component of Kubeflow and not the entire Kubeflow installation.

I applied the mpi controller as follows:

kustomize build manifests-v1.1.0-branch/mpi-job/mpi-operator/overlays/application/ > lyft-mpi-operator.yaml
kubectl apply -f lyft-mpi-operator.yaml

This is the resultant lyft-mpi-operator.yaml that got generated from kustomize step.

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpijobs.kubeflow.org
spec:
  group: kubeflow.org
  names:
    kind: MPIJob
    plural: mpijobs
    shortNames:
    - mj
    - mpij
    singular: mpijob
  scope: Namespaced
  versions:
  - name: v1alpha1
    schema:
      openAPIV3Schema:
        properties:
          spec:
            description: Only one of gpus, processingUnits, or replicas should be
              specified
            oneOf:
            - properties:
                gpus:
                  description: Valid values are 1, 2, 4, or any multiple of 8
                  oneOf:
                  - enum:
                    - 1
                    - 2
                    - 4
                    type: integer
                  - minimum: 8
                    multipleOf: 8
                    type: integer
                  title: Total number of GPUs
                gpusPerNode:
                  description: Defaults to the number of GPUs per worker
                  minimum: 1
                  title: The maximum number of GPUs available per node
                  type: integer
                slotsPerWorker:
                  description: Defaults to the number of processing units per worker
                  minimum: 1
                  title: The number of slots per worker used in hostfile
                  type: integer
              required:
              - gpus
            - properties:
                processingResourceType:
                  description: Defaults to 'nvidia.com/gpu'
                  enum:
                  - nvidia.com/gpu
                  - cpu
                  title: The processing resource type, e.g. 'nvidia.com/gpu' or 'cpu'
                  type: string
                processingUnits:
                  description: Valid values are 1, 2, 4, or any multiple of 8
                  oneOf:
                  - enum:
                    - 1
                    - 2
                    - 4
                    type: integer
                  - minimum: 8
                    multipleOf: 8
                    type: integer
                  title: Total number of processing units
                processingUnitsPerNode:
                  description: Defaults to the number of processing units per worker
                  minimum: 1
                  title: The maximum number of processing units available per node
                  type: integer
                slotsPerWorker:
                  description: Defaults to the number of processing units per worker
                  minimum: 1
                  title: The number of slots per worker used in hostfile
                  type: integer
              required:
              - processingUnits
            - properties:
                processingResourceType:
                  description: Defaults to 'nvidia.com/gpu'
                  enum:
                  - nvidia.com/gpu
                  - cpu
                  title: The processing resource type, e.g. 'nvidia.com/gpu' or 'cpu'
                  type: string
                replicas:
                  description: The processing resource limit should be specified for
                    each replica
                  minimum: 1
                  title: Total number of replicas
                  type: integer
                slotsPerWorker:
                  description: Defaults to the number of processing units per worker
                  minimum: 1
                  title: The number of slots per worker used in hostfile
                  type: integer
              required:
              - replicas
            title: The MPIJob spec
    served: false
    storage: false
  - name: v1alpha2
    schema:
      openAPIV3Schema:
        properties:
          spec:
            properties:
              mpiReplicaSpecs:
                properties:
                  Launcher:
                    properties:
                      replicas:
                        maximum: 1
                        minimum: 1
                        type: integer
                  Worker:
                    properties:
                      replicas:
                        minimum: 1
                        type: integer
              slotsPerWorker:
                minimum: 1
                type: integer
    served: true
    storage: false
  - name: v1
    schema:
      openAPIV3Schema:
        properties:
          spec:
            properties:
              mpiReplicaSpecs:
                properties:
                  Launcher:
                    properties:
                      replicas:
                        maximum: 1
                        minimum: 1
                        type: integer
                  Worker:
                    properties:
                      replicas:
                        minimum: 1
                        type: integer
              slotsPerWorker:
                minimum: 1
                type: integer
    served: true
    storage: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
  labels:
    app: mpi-operator
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator
  namespace: kubeflow
---
aggregationRule:
  clusterRoleSelectors:
  - matchLabels:
      rbac.authorization.kubeflow.org/aggregate-to-kubeflow-mpijobs-admin: "true"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
    rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin: "true"
  name: kubeflow-mpijobs-admin
rules: []
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
    rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true"
    rbac.authorization.kubeflow.org/aggregate-to-kubeflow-mpijobs-admin: "true"
  name: kubeflow-mpijobs-edit
rules:
- apiGroups:
  - kubeflow.org
  resources:
  - mpijobs
  - mpijobs/status
  verbs:
  - get
  - list
  - watch
  - create
  - delete
  - deletecollection
  - patch
  - update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
    rbac.authorization.kubeflow.org/aggregate-to-kubeflow-view: "true"
  name: kubeflow-mpijobs-view
rules:
- apiGroups:
  - kubeflow.org
  resources:
  - mpijobs
  - mpijobs/status
  verbs:
  - get
  - list
  - watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  labels:
    app: mpi-operator
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator
rules:
- apiGroups:
  - ""
  resources:
  - configmaps
  - serviceaccounts
  verbs:
  - create
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - ""
  resources:
  - pods/exec
  verbs:
  - create
- apiGroups:
  - ""
  resources:
  - endpoints
  verbs:
  - create
  - get
  - update
- apiGroups:
  - ""
  resources:
  - events
  verbs:
  - create
  - patch
- apiGroups:
  - rbac.authorization.k8s.io
  resources:
  - roles
  - rolebindings
  verbs:
  - create
  - list
  - watch
- apiGroups:
  - policy
  resources:
  - poddisruptionbudgets
  verbs:
  - create
  - list
  - update
  - watch
- apiGroups:
  - apps
  resources:
  - statefulsets
  verbs:
  - create
  - list
  - update
  - watch
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - create
  - list
  - update
  - watch
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - create
  - get
- apiGroups:
  - kubeflow.org
  resources:
  - mpijobs
  - mpijobs/finalizers
  - mpijobs/status
  verbs:
  - '*'
- apiGroups:
  - scheduling.incubator.k8s.io
  - scheduling.sigs.dev
  resources:
  - queues
  - podgroups
  verbs:
  - '*'
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  labels:
    app: mpi-operator
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: mpi-operator
subjects:
- kind: ServiceAccount
  name: mpi-operator
  namespace: kubeflow
---
apiVersion: v1
data:
  kubectl-delivery-image: docker.io/mpioperator/kubectl-delivery:latest
  lock-namespace: kubeflow
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator-config
  namespace: kubeflow
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
    kustomize.component: mpi-operator
  name: mpi-operator
  namespace: kubeflow
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mpi-operator
      app.kubernetes.io/component: mpijob
      app.kubernetes.io/name: mpi-operator
      kustomize.component: mpi-operator
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      labels:
        app: mpi-operator
        app.kubernetes.io/component: mpijob
        app.kubernetes.io/name: mpi-operator
        kustomize.component: mpi-operator
    spec:
      containers:
      - args:
        - -alsologtostderr
        - --lock-namespace
        - kubeflow
        - --kubectl-delivery-image
        - docker.io/mpioperator/kubectl-delivery:latest
        image: docker.io/mpioperator/mpi-operator:latest
        imagePullPolicy: Always
        name: mpi-operator
      serviceAccountName: mpi-operator
---
apiVersion: app.k8s.io/v1beta1
kind: Application
metadata:
  labels:
    app.kubernetes.io/component: mpijob
    app.kubernetes.io/name: mpi-operator
  name: mpi-operator
spec:
  componentKinds:
  - group: apps
    kind: Deployment
  - group: core
    kind: ServiceAccount
  - group: kubeflow.org
    kind: MPIJob
  descriptor:
    description: Mpi-operator allows users to create and manage the "MPIJob" custom
      resource.
    keywords:
    - mpijob
    - mpi-operator
    links:
    - description: About
      url: https://github.com/kubeflow/mpi-operator
    maintainers:
    - email: rong.ou@gmail.com
      name: Rong Ou
    - email: terrytangyuan@gmail.com
      name: Yuan Tang
    - email: stp.abhi@gmail.com
      name: Abhilash Pallerlamudi
    owners:
    - email: rong.ou@gmail.com
      name: Rong Ou
    - email: terrytangyuan@gmail.com
      name: Yuan Tang
    type: mpi-operator
    version: v1
  selector:
    matchLabels:
      app.kubernetes.io/component: mpijob
      app.kubernetes.io/instance: mpi-operator
      app.kubernetes.io/managed-by: kfctl
      app.kubernetes.io/name: mpi-operator
      app.kubernetes.io/part-of: kubeflow
      app.kubernetes.io/version: v1.0

update crd and ClusterRole to v0.2.3, then try

asahalyft commented 3 years ago

Thanks @qifengz .

@carmark @chongchuanbing Weird. I have moved away from Kubeflow kustomize and applying operator yaml directly now. The feedback on this issue had been extremely slow, so I had to downgrade to mpi-operator v1alpha2 and applied the https://github.com/kubeflow/mpi-operator/tree/master/deploy/v1alpha2 to make some progress on my end.

The v1 operator from https://github.com/kubeflow/mpi-operator/tree/master/deploy/v1 still did not work for me.

Both CRD defs from v1 and v1alpha2 have the section

subresources:
    status: {}

Are there any clear benefits of using v1 over v1alpha2?

chongchuanbing commented 3 years ago

Thanks @qifengz .

@carmark @chongchuanbing Weird. I have moved away from Kubeflow kustomize and applying operator yaml directly now. The feedback on this issue had been extremely slow, so I had to downgrade to mpi-operator v1alpha2 and applied the https://github.com/kubeflow/mpi-operator/tree/master/deploy/v1alpha2 to make some progress on my end.

The v1 operator from https://github.com/kubeflow/mpi-operator/tree/master/deploy/v1 still did not work for me.

Both CRD defs from v1 and v1alpha2 have the section
subresources:
    status: {}
Are there any clear benefits of using v1 over v1alpha2?

what's error about v1, i run this ok.

carmark commented 3 years ago

@asahalyft I did test it locally with v1, but installed it with yaml file, it works as expected.

Yes, you can switch to v1alpha2, longer version.

OpenZSD commented 3 years ago

I'm encountering this issue, is there a quick fix/hack to this issue?

alculquicondor commented 3 years ago

Please provide more information: the version you are running, the api version you are using, etc.

OpenZSD commented 3 years ago

Let me know what else to query...

kubeflow.org/v1 kubeflow.org/v1alpha1 kubeflow.org/v1alpha2 kubeflow.org/v1beta1 Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:03:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}

OpenZSD commented 3 years ago

As for my job manifest:

apiVersion: kubeflow.org/v1beta1 kind: MPIJob metadata: ...

alculquicondor commented 3 years ago

Can you confirm which version of the mpi-operator you are running? I suppose you can have a look at the deployment image.

OpenZSD commented 3 years ago

via [k get all -n mpi-operator -oyaml > mpi-operator-k-get-all.yaml] : .../docker.io/mpi-operator:v0.2.3

alculquicondor commented 3 years ago

Are you getting the same kind of error in the logs?

error syncing 'asaha/tf2-keras-mnist-mpi-gpu': the server could not find the requested resource (put mpijobs.kubeflow.org tf2-keras-mnist-mpi-gpu)

If so, I believe the manifest here: https://github.com/kubeflow/mpi-operator/blob/master/deploy/v1/mpi-operator.yaml shouldn't have the problem mentioned here https://github.com/kubeflow/mpi-operator/issues/297#issuecomment-758612853

OpenZSD commented 3 years ago

Couldn't find that line. Below is a chunk of the mpi pod log. I0923 17:33:45.775485 1 server.go:88] Using cluster scoped operator I0923 17:33:45.775535 1 server.go:94] [API Version: v1alpha2 Version: v0.2.2 Git SHA: aa96794299fa336f2132fb15fe98cc4b7f1d2599 Built: 2020-05-19 15:51:49 Go Version: go1.13.6 Go OS/Arch: linux/amd64] I0923 17:33:45.775550 1 server.go:97] Server options: &{Kubeconfig: MasterURL: KubectlDeliveryImage:dtr.thefacebook.com/docker.io/kubectl-delivery:v0.2.3 Threadiness:2 MonitoringPort:0 PrintVersion:false GangSchedulingName: Namespace: LockNamespace:mpi-operator} W0923 17:33:45.775614 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0923 17:33:45.862667 1 leaderelection.go:235] attempting to acquire leader lease mpi-operator/mpi-operator... I0923 17:33:45.862685 1 server.go:204] Start listening to 8080 for health check I0923 17:33:45.879171 1 leaderelection.go:245] successfully acquired lease mpi-operator/mpi-operator I0923 17:33:45.879360 1 server.go:242] Leading started I0923 17:33:45.879321 1 event.go:258] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"mpi-operator", Name:"mpi-operator", UID:"259796c3-a574-4026-abcd-fb5b72ea789a", APIVersion:"v1", ResourceVersion:"116386418", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' mpi-operator-6d7f5ff98c-v9dz6_69fcf4a1-a96e-45d3-b00c-f783074a903a became leader I0923 17:33:45.879731 1 mpi_job_controller.go:219] Setting up event handlers I0923 17:33:45.879830 1 mpi_job_controller.go:353] Starting MPIJob controller I0923 17:33:45.879857 1 mpi_job_controller.go:356] Waiting for informer caches to sync I0923 17:33:46.380088 1 mpi_job_controller.go:366] Starting workers I0923 17:33:46.380110 1 mpi_job_controller.go:372] Started workers I0923 17:33:46.380426 1 event.go:258] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"team", Name:"mpiexampledemo28", UID:"8e2a62c7-149b-44cd-a8e0-0a1362263ec2", APIVersion:"kubeflow.org/v1alpha2", ResourceVersion:"116149843", FieldPath:""}): type: 'Normal' reason: 'MPIJobSucceeded' MPIJob team/mpiexampledemo28 successfully completed. I0923 17:33:46.380515 1 event.go:258] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"team", Name:"tensorflow-benchmarks", UID:"3aab9e68-4915-4e54-8a2b-e438332967a8", APIVersion:"kubeflow.org/v1alpha2", ResourceVersion:"115928376", FieldPath:""}): type: 'Warning' reason: 'MPIJobFailed' MPIJob team/tensorflow-benchmarks has failed I0923 17:33:46.383235 1 mpi_job_controller.go:447] Finished syncing job "team/tensorflow-benchmarks" (3.059812ms) E0923 17:33:46.383283 1 mpi_job_controller.go:434] error syncing 'team/tensorflow-benchmarks': mpijobs.kubeflow.org "tensorflow-benchmarks" not found

alculquicondor commented 3 years ago

error syncing 'team/tensorflow-benchmarks': mpijobs.kubeflow.org "tensorflow-benchmarks" not found

Uhm.... And the job still exists?

OpenZSD commented 3 years ago

Not sure about the one in the default namespace. Here's a more current one: Log from MPI-Operator pod: I1006 23:24:18.960746 1 mpi_job_controller.go:447] Finished syncing job "team/mpiexampledemo44" (70.336037ms) E1006 23:24:18.960789 1 mpi_job_controller.go:434] error syncing 'team/mpiexampledemo44': mpijobs.kubeflow.org "mpiexampledemo44" not found I1006 23:24:18.961804 1 mpi_job_controller.go:447] Finished syncing job "team/mpiexampledemo44" (987.295µs) E1006 23:24:18.961843 1 mpi_job_controller.go:434] error syncing 'team/mpiexampledemo44': mpijobs.kubeflow.org "mpiexampledemo44" not found I1006 23:24:19.601510 1 mpi_job_controller.go:447] Finished syncing job "team/mpiexampledemo44" (1.515862ms) E1006 23:24:19.601580 1 mpi_job_controller.go:434] error syncing 'team/mpiexampledemo44': mpijobs.kubeflow.org "mpiexampledemo44" not found I1006 23:24:40.083734 1 mpi_job_controller.go:447] Finished syncing job "team/mpiexampledemo44" (1.978215ms) E1006 23:24:40.083784 1 mpi_job_controller.go:434] error syncing 'team/mpiexampledemo44': mpijobs.kubeflow.org "mpiexampledemo44" not found I1006 23:25:36.164449 1 event.go:258] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"tensorflow-benchmarks", UID:"2d778959-dc12-44b2-9181-6757964ffa55", APIVersion:"kubeflow.org/v1alpha2", ResourceVersion:"123245452", FieldPath:""}): type: 'Warning' reason: 'MPIJobFailed' MPIJob default/tensorflow-benchmarks has failed I1006 23:25:36.169236 1 mpi_job_controller.go:447] Finished syncing job "default/tensorflow-benchmarks" (4.971223ms) E1006 23:25:36.169283 1 mpi_job_controller.go:434] error syncing 'default/tensorflow-benchmarks': mpijobs.kubeflow.org "tensorflow-benchmarks" not found I1006 23:27:23.924141 1 event.go:258] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"team", Name:"mpiexampledemo44", UID:"979808dc-f477-426e-992d-75a43ea9cc2c", APIVersion:"kubeflow.org/v1alpha2", ResourceVersion:"124469782", FieldPath:""}): type: 'Normal' reason: 'MPIJobSucceeded' MPIJob team/mpiexampledemo44 successfully completed. I1006 23:27:23.926522 1 mpi_job_controller.go:447] Finished syncing job "team/mpiexampledemo44" (2.542902ms) E1006 23:27:23.926596 1 mpi_job_controller.go:434] error syncing 'team/mpiexampledemo44': mpijobs.kubeflow.org "mpiexampledemo44" not found

=========================================================================================================================================

other info: kubectl get pods NAME READY STATUS RESTARTS AGE ... mpiexampledemo44-launcher-9vbvv 0/1 Completed 0 17m mpiexampledemo44-worker-0 1/1 Running 0 17m mpiexampledemo44-worker-1 1/1 Running 0 17m mpiexampledemo44-worker-2 1/1 Running 0 17m

kubectl get mpijobs NAME AGE mpiexampledemo44 33m

alculquicondor commented 3 years ago

So something external to the controller seems to be removing the mpijob, and perhaps it's doing that with cascade=false.

hongyonggan commented 2 years ago

Hi asahalyft, In my case it is possible to cleanPod succeed, You can try runPolicy: cleanPodPolicy: Running

e.g.

apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
  name: tf2-keras-mnist-mpi-gpu
spec:
  slotsPerWorker: 1
  runPolicy:
    cleanPodPolicy: Running
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        spec:
          containers:
          - image: docker.io/horovod/horovod:0.19.3-tf2.1.0-torch-mxnet1.6.0-py3.6-gpu
            name: keras-mnist-mpi-launcher
            command:
            - mpirun
            args:
            - -np
            - "2"
            - --allow-run-as-root
            - -bind-to
            - none
            - -map-by
            - slot
            - -x
            - LD_LIBRARY_PATH
            - -x
            - PATH
            - -mca
            - pml
            - ob1
            - -mca
            - btl
            - ^openib
            - python
            - /examples/tensorflow2_keras_mnist.py
            resources:
              limits:
                cpu: 1
                memory: 2Gi
    Worker:
      replicas: 2
      template:
        spec:
          containers:
          - image: docker.io/horovod/horovod:0.19.3-tf2.1.0-torch-mxnet1.6.0-py3.6-gpu
            name: keras-mnist-mpi-worker
            resources:
              limits:
                nvidia.com/gpu: 1

heyfey commented 2 years ago

Met the same issue, and solution from @hongyonggan doesn't work for me.

I applied the training operator as follows: kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"

kubectl logs -f training-operator-5cfdcb7d9d-nvf57 -n kubeflow:

time="2022-04-02T15:07:31Z" level=info msg="MPIJob default/tensorflow2-keras-mnist-elastic is created."
time="2022-04-02T15:07:31Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:31.826Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "SuccessfulCreatePod", "message": "Created worker pod: tensorflow2-keras-mnist-elastic-worker-0"}
2022-04-02T15:07:31.851Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "SuccessfulCreatePod", "message": "Created worker pod: tensorflow2-keras-mnist-elastic-worker-1"}
2022-04-02T15:07:31.857Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "MPIJobRunning", "message": "launcher pod created success: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.857Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.857Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:31Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
2022-04-02T15:07:31.857Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:31Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:31Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:31Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (93.538317ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:31Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:31.952Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.952Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.952Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.952Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.952Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.952Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:31Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:31Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:31Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:32Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (97.54632ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:32Z" level=warning msg="Reconcile MPIJob error Operation cannot be fulfilled on mpijobs.kubeflow.org \"tensorflow2-keras-mnist-elastic\": the object has been modified; please apply your changes to the latest version and try again"
2022-04-02T15:07:32.050Z        ERROR   controller-runtime.manager.controller.mpijob-controller Reconciler error        {"name": "tensorflow2-keras-mnist-elastic", "namespace": "default", "error": "Operation cannot be fulfilled on mpijobs.kubeflow.org \"tensorflow2-keras-mnist-elastic\": the object has been modified; please apply your changes to the latest version and try again"}
time="2022-04-02T15:07:32Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:32.051Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.051Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.051Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.051Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.051Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:32Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
2022-04-02T15:07:32.051Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:32Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:32Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:32Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:32.056Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.056Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.056Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.056Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.057Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.057Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:32Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:32Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:32Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.310Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.323Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.323Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.323Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.323Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.323Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=1, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (14.157662ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.350Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.350Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=1, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
2022-04-02T15:07:34.350Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.350Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.350Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.350Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.828Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.828Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.828Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.828Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.828Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=1, succeeded=0 , failed=0"
2022-04-02T15:07:34.829Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.842Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.849Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.849Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.849Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.849Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.849Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (9.422552ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.859Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.859Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.859Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.860Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.860Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
2022-04-02T15:07:34.860Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (8.803506ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=warning msg="Reconcile MPIJob error Operation cannot be fulfilled on mpijobs.kubeflow.org \"tensorflow2-keras-mnist-elastic\": the object has been modified; please apply your changes to the latest version and try again"
2022-04-02T15:07:34.869Z        ERROR   controller-runtime.manager.controller.mpijob-controller Reconciler error        {"name": "tensorflow2-keras-mnist-elastic", "namespace": "default", "error": "Operation cannot be fulfilled on mpijobs.kubeflow.org \"tensorflow2-keras-mnist-elastic\": the object has been modified; please apply your changes to the latest version and try again"}
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.869Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.870Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.870Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.870Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.251Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.251Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:35Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:35Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:35Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:35Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:35.251Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.251Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.251Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.252Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.252Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:35Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:35Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
2022-04-02T15:07:35.252Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:35Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:37Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:37.644Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:37.644Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:37.644Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:37.645Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:37.645Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:37.645Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:37Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:37Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:37Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:38Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:38.642Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.642Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.642Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.642Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "MPIJobRunning", "message": "MPIJob default/tensorflow2-keras-mnist-elastic is running"}
2022-04-02T15:07:38.642Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.643Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.643Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:38Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=1, succeeded=0 , failed=0"
time="2022-04-02T15:07:38Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:38Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
2022-04-02T15:07:38.643Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "MPIJobRunning", "message": "MPIJob default/tensorflow2-keras-mnist-elastic is running"}
time="2022-04-02T15:07:38Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (15.886545ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:38Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:38.659Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.659Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.660Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.660Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "MPIJobRunning", "message": "MPIJob default/tensorflow2-keras-mnist-elastic is running"}
2022-04-02T15:07:38.660Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.660Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:38Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=1, succeeded=0 , failed=0"
time="2022-04-02T15:07:38Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
2022-04-02T15:07:38.660Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.660Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "MPIJobRunning", "message": "MPIJob default/tensorflow2-keras-mnist-elastic is running"}
time="2022-04-02T15:07:38Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:13:49Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
time="2022-04-02T15:13:49Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=0, running=0, succeeded=1 , failed=0"
time="2022-04-02T15:13:49Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is successfully completed."
2022-04-02T15:13:49.503Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "MPIJobSucceeded", "message": "MPIJob default/tensorflow2-keras-mnist-elastic successfully completed."}
2022-04-02T15:13:49.504Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "MPIJobSucceeded", "message": "MPIJob default/tensorflow2-keras-mnist-elastic successfully completed."}
2022-04-02T15:13:49.504Z        DEBUG   controller-runtime.manager.events       Normal  {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "JobSucceeded", "message": "MPIJob tensorflow2-keras-mnist-elastic is successfully completed."}
time="2022-04-02T15:13:49Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (30.703899ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:13:49Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"

heyfey commented 2 years ago

Met the same issue, and solution from @hongyonggan doesn't work for me.

I applied the training operator as follows: kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"

kubectl logs -f training-operator-5cfdcb7d9d-nvf57 -n kubeflow:

Resolved by updating manifests with latest image tag (8c4323194a09b3cabd36248cc172493006f71b75), as it is fixed by https://github.com/kubeflow/training-operator/pull/1550

kubeflow / mpi-operator

MPIJob Pods shows status RUNNING despite MPIJob Completed #297