Open asahalyft opened 3 years ago
It is weird, I will have a look. Thanks for the report!
cc @Jeffwan @PatrickXYS
Could you also share logs from worker pods that kept running?
sure @terrytangyuan
The source code that is being referred in my test yaml posted above is from the examples directory of the horovod project itself https://github.com/horovod/horovod/blob/master/examples/tensorflow2/tensorflow2_keras_mnist.py
P.S. I observe this same behavior that worker pods keep on running also with the provided examples in the mpi-operator repo https://github.com/kubeflow/mpi-operator/blob/master/examples/v1/tensorflow-benchmarks.yaml
Logs from the Launcher Pod which completed
(base) asaha-mbp151:exploration asaha$ kubectl logs tf2-keras-mnist-mpi-gpu-launcher-8jlf6 -n asaha
+ POD_NAME=tf2-keras-mnist-mpi-gpu-worker-0
+ [ t = - ]
+ shift
+ /opt/kube/kubectl cp /opt/kube/hosts tf2-keras-mnist-mpi-gpu-worker-0:/etc/hosts_of_nodes
+ POD_NAME=tf2-keras-mnist-mpi-gpu-worker-1
+ [ t = - ]
+ shift
+ /opt/kube/kubectl cp /opt/kube/hosts tf2-keras-mnist-mpi-gpu-worker-1:/etc/hosts_of_nodes
+ /opt/kube/kubectl exec tf2-keras-mnist-mpi-gpu-worker-0 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts && PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2374565888" -mca ess_base_vpid 1 -mca ess_base_num_procs "3" -mca orte_node_regex "tf[1:2]-keras-mnist-mpi-gpu-launcher-8jlf6,tf[1:2]-keras-mnist-mpi-gpu-worker-0,tf[1:2]-keras-mnist-mpi-gpu-worker-1@0(3)" -mca orte_hnp_uri "2374565888.0;tcp://192.168.33.8:50987" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "2374565888.0;tcp://192.168.33.8:50987" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
+ /opt/kube/kubectl exec tf2-keras-mnist-mpi-gpu-worker-1 -- /bin/sh -c cat /etc/hosts_of_nodes >> /etc/hosts && PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2374565888" -mca ess_base_vpid 2 -mca ess_base_num_procs "3" -mca orte_node_regex "tf[1:2]-keras-mnist-mpi-gpu-launcher-8jlf6,tf[1:2]-keras-mnist-mpi-gpu-worker-0,tf[1:2]-keras-mnist-mpi-gpu-worker-1@0(3)" -mca orte_hnp_uri "2374565888.0;tcp://192.168.33.8:50987" -mca pml "ob1" -mca btl "^openib" -mca plm "rsh" --tree-spawn -mca orte_parent_uri "2374565888.0;tcp://192.168.33.8:50987" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca hwloc_base_binding_policy "none" -mca rmaps_base_mapping_policy "slot" -mca pmix "^s1,s2,cray,isolated"
2020-11-26 09:43:10.768117: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-26 09:43:10.768258: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-26 09:43:10.768281: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-11-26 09:43:10.821745: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-26 09:43:10.821897: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2020-11-26 09:43:10.821922: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-11-26 09:43:11.648769: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-11-26 09:43:11.649522: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-11-26 09:43:11.671662: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.672400: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-11-26 09:43:11.672453: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:11.674625: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:11.676512: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-11-26 09:43:11.676898: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-11-26 09:43:11.679141: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-11-26 09:43:11.680413: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-11-26 09:43:11.683938: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.684966: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-26 09:43:11.685145: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.685772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:1b.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-11-26 09:43:11.685822: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:11.686063: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.686820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
2020-11-26 09:43:11.687864: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:11.689874: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-11-26 09:43:11.690250: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-11-26 09:43:11.692345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-11-26 09:43:11.693626: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-11-26 09:43:11.698189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-26 09:43:11.698290: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.700139: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:11.701903: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 0s 0us/step
11493376/11490434 [==============================] - 0s 0us/step
2020-11-26 09:43:12.455964: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-11-26 09:43:12.481294: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300010000 Hz
2020-11-26 09:43:12.483565: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x53037b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-26 09:43:12.483618: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-11-26 09:43:12.497975: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-11-26 09:43:12.504248: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300060000 Hz
2020-11-26 09:43:12.504602: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4992550 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-11-26 09:43:12.504629: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-11-26 09:43:12.562221: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.563112: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x49e2c20 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-11-26 09:43:12.563142: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-11-26 09:43:12.563341: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.564093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-11-26 09:43:12.564160: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:12.564226: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:12.564261: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-11-26 09:43:12.564300: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-11-26 09:43:12.564335: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-11-26 09:43:12.564374: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-11-26 09:43:12.564406: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-26 09:43:12.564486: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.565253: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.566020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-11-26 09:43:12.566080: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:12.577038: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.578912: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x53012a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-11-26 09:43:12.578941: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla K80, Compute Capability 3.7
2020-11-26 09:43:12.579103: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.580847: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties:
pciBusID: 0000:00:1b.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-11-26 09:43:12.580897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:12.580936: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:12.580963: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-11-26 09:43:12.580979: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-11-26 09:43:12.580993: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-11-26 09:43:12.581009: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-11-26 09:43:12.581024: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-26 09:43:12.581079: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.582886: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.584583: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-11-26 09:43:12.584636: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-11-26 09:43:12.634981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-26 09:43:12.635035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-11-26 09:43:12.635048: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-11-26 09:43:12.635306: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.636123: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.636874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10798 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2020-11-26 09:43:12.652102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-26 09:43:12.652145: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] 0
2020-11-26 09:43:12.652154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0: N
2020-11-26 09:43:12.652446: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.654295: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-26 09:43:12.656028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10798 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1b.0, compute capability: 3.7)
Epoch 1/24
2020-11-26 09:43:14.498415: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:14.527299: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-11-26 09:43:14.687915: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-11-26 09:43:14.713911: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.242629). Check your callbacks.
WARNING:tensorflow:Method (on_train_batch_end) is slow compared to the batch update (0.242464). Check your callbacks.
250/250 [==============================] - 9s 34ms/step - loss: 0.2967 - accuracy: 0.8145
Epoch 2/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0954 - accuracy: 0.9701
Epoch 3/24
248/250 [============================>.] - ETA: 0s - loss: 0.0719 - accuracy: 0.9772
Epoch 3: finished gradual learning rate warmup to 0.002.
Epoch 3: finished gradual learning rate warmup to 0.002.
250/250 [==============================] - 8s 31ms/step - loss: 0.0725 - accuracy: 0.9773
Epoch 4/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0601 - accuracy: 0.9813
Epoch 5/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0499 - accuracy: 0.9848
Epoch 6/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0406 - accuracy: 0.9858
Epoch 7/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0413 - accuracy: 0.9866
Epoch 8/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0346 - accuracy: 0.9888
Epoch 9/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0326 - accuracy: 0.9897
Epoch 10/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0277 - accuracy: 0.9909
Epoch 11/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0248 - accuracy: 0.9908
Epoch 12/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0232 - accuracy: 0.9916
Epoch 13/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0212 - accuracy: 0.9925
Epoch 14/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0197 - accuracy: 0.9936
Epoch 15/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0215 - accuracy: 0.9930
Epoch 16/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0191 - accuracy: 0.9931
Epoch 17/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0206 - accuracy: 0.9933
Epoch 18/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0172 - accuracy: 0.9947
Epoch 19/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0173 - accuracy: 0.9950
Epoch 20/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0146 - accuracy: 0.9953
Epoch 21/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0136 - accuracy: 0.9948
Epoch 22/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0161 - accuracy: 0.9944
Epoch 23/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0161 - accuracy: 0.9945
Epoch 24/24
250/250 [==============================] - 8s 31ms/step - loss: 0.0167 - accuracy: 0.9950
(base) asaha-mbp151:exploration asaha$
Logs from the Worker Pod which still keep on running I do not see any log from the worker pod.
(base) asaha-mbp151:exploration asaha$ kubectl get pods -n asaha
NAME READY STATUS RESTARTS AGE
tf2-keras-mnist-mpi-gpu-launcher-8jlf6 0/1 Completed 0 6m5s
tf2-keras-mnist-mpi-gpu-worker-0 1/1 Running 0 6m5s
tf2-keras-mnist-mpi-gpu-worker-1 1/1 Running 0 6m5s
(base) asaha-mbp151:exploration asaha$ kubectl logs tf2-keras-mnist-mpi-gpu-worker-0 -n asaha
(base) asaha-mbp151:exploration asaha$
@asahalyft The worker pods status is right, the mpijob's status should be synced with the launcher pod. I think the main problem is caused by the sync error.
@carmark @terrytangyuan @gaocegege As I read through the controller codes I understood from https://github.com/kubeflow/mpi-operator/blob/75f424a802dafb3662bc5c76b8f3c3cb60127fac/pkg/controllers/v1/mpi_job_controller.go#L471 this is where the syncing logic is written to kill the worker pods when the MPIJob has completed. However, that is not happening.
I have been able to reproduce this same error using the same yaml on a non AWS EKS but on an AWS on-prem K8s 1.14 cluster as well. Are you able to reproduce the error on your side?
It would be really helpful if you all could 👀 and help in resolving the issue.
Also, for context/completeness, I am installing only the mpi operator component of Kubeflow and not the entire Kubeflow installation.
I applied the mpi controller as follows:
kustomize build manifests-v1.1.0-branch/mpi-job/mpi-operator/overlays/application/ > lyft-mpi-operator.yaml
kubectl apply -f lyft-mpi-operator.yaml
This is the resultant lyft-mpi-operator.yaml that got generated from kustomize step.
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
labels:
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
kustomize.component: mpi-operator
name: mpijobs.kubeflow.org
spec:
group: kubeflow.org
names:
kind: MPIJob
plural: mpijobs
shortNames:
- mj
- mpij
singular: mpijob
scope: Namespaced
versions:
- name: v1alpha1
schema:
openAPIV3Schema:
properties:
spec:
description: Only one of gpus, processingUnits, or replicas should be
specified
oneOf:
- properties:
gpus:
description: Valid values are 1, 2, 4, or any multiple of 8
oneOf:
- enum:
- 1
- 2
- 4
type: integer
- minimum: 8
multipleOf: 8
type: integer
title: Total number of GPUs
gpusPerNode:
description: Defaults to the number of GPUs per worker
minimum: 1
title: The maximum number of GPUs available per node
type: integer
slotsPerWorker:
description: Defaults to the number of processing units per worker
minimum: 1
title: The number of slots per worker used in hostfile
type: integer
required:
- gpus
- properties:
processingResourceType:
description: Defaults to 'nvidia.com/gpu'
enum:
- nvidia.com/gpu
- cpu
title: The processing resource type, e.g. 'nvidia.com/gpu' or 'cpu'
type: string
processingUnits:
description: Valid values are 1, 2, 4, or any multiple of 8
oneOf:
- enum:
- 1
- 2
- 4
type: integer
- minimum: 8
multipleOf: 8
type: integer
title: Total number of processing units
processingUnitsPerNode:
description: Defaults to the number of processing units per worker
minimum: 1
title: The maximum number of processing units available per node
type: integer
slotsPerWorker:
description: Defaults to the number of processing units per worker
minimum: 1
title: The number of slots per worker used in hostfile
type: integer
required:
- processingUnits
- properties:
processingResourceType:
description: Defaults to 'nvidia.com/gpu'
enum:
- nvidia.com/gpu
- cpu
title: The processing resource type, e.g. 'nvidia.com/gpu' or 'cpu'
type: string
replicas:
description: The processing resource limit should be specified for
each replica
minimum: 1
title: Total number of replicas
type: integer
slotsPerWorker:
description: Defaults to the number of processing units per worker
minimum: 1
title: The number of slots per worker used in hostfile
type: integer
required:
- replicas
title: The MPIJob spec
served: false
storage: false
- name: v1alpha2
schema:
openAPIV3Schema:
properties:
spec:
properties:
mpiReplicaSpecs:
properties:
Launcher:
properties:
replicas:
maximum: 1
minimum: 1
type: integer
Worker:
properties:
replicas:
minimum: 1
type: integer
slotsPerWorker:
minimum: 1
type: integer
served: true
storage: false
- name: v1
schema:
openAPIV3Schema:
properties:
spec:
properties:
mpiReplicaSpecs:
properties:
Launcher:
properties:
replicas:
maximum: 1
minimum: 1
type: integer
Worker:
properties:
replicas:
minimum: 1
type: integer
slotsPerWorker:
minimum: 1
type: integer
served: true
storage: true
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app: mpi-operator
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
kustomize.component: mpi-operator
name: mpi-operator
namespace: kubeflow
---
aggregationRule:
clusterRoleSelectors:
- matchLabels:
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-mpijobs-admin: "true"
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
kustomize.component: mpi-operator
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin: "true"
name: kubeflow-mpijobs-admin
rules: []
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
kustomize.component: mpi-operator
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true"
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-mpijobs-admin: "true"
name: kubeflow-mpijobs-edit
rules:
- apiGroups:
- kubeflow.org
resources:
- mpijobs
- mpijobs/status
verbs:
- get
- list
- watch
- create
- delete
- deletecollection
- patch
- update
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
kustomize.component: mpi-operator
rbac.authorization.kubeflow.org/aggregate-to-kubeflow-view: "true"
name: kubeflow-mpijobs-view
rules:
- apiGroups:
- kubeflow.org
resources:
- mpijobs
- mpijobs/status
verbs:
- get
- list
- watch
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
labels:
app: mpi-operator
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
kustomize.component: mpi-operator
name: mpi-operator
rules:
- apiGroups:
- ""
resources:
- configmaps
- serviceaccounts
verbs:
- create
- list
- watch
- apiGroups:
- ""
resources:
- pods
verbs:
- get
- list
- watch
- apiGroups:
- ""
resources:
- pods/exec
verbs:
- create
- apiGroups:
- ""
resources:
- endpoints
verbs:
- create
- get
- update
- apiGroups:
- ""
resources:
- events
verbs:
- create
- patch
- apiGroups:
- rbac.authorization.k8s.io
resources:
- roles
- rolebindings
verbs:
- create
- list
- watch
- apiGroups:
- policy
resources:
- poddisruptionbudgets
verbs:
- create
- list
- update
- watch
- apiGroups:
- apps
resources:
- statefulsets
verbs:
- create
- list
- update
- watch
- apiGroups:
- batch
resources:
- jobs
verbs:
- create
- list
- update
- watch
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- create
- get
- apiGroups:
- kubeflow.org
resources:
- mpijobs
- mpijobs/finalizers
- mpijobs/status
verbs:
- '*'
- apiGroups:
- scheduling.incubator.k8s.io
- scheduling.sigs.dev
resources:
- queues
- podgroups
verbs:
- '*'
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
labels:
app: mpi-operator
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
kustomize.component: mpi-operator
name: mpi-operator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: mpi-operator
subjects:
- kind: ServiceAccount
name: mpi-operator
namespace: kubeflow
---
apiVersion: v1
data:
kubectl-delivery-image: docker.io/mpioperator/kubectl-delivery:latest
lock-namespace: kubeflow
kind: ConfigMap
metadata:
labels:
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
kustomize.component: mpi-operator
name: mpi-operator-config
namespace: kubeflow
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
kustomize.component: mpi-operator
name: mpi-operator
namespace: kubeflow
spec:
replicas: 1
selector:
matchLabels:
app: mpi-operator
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
kustomize.component: mpi-operator
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
labels:
app: mpi-operator
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
kustomize.component: mpi-operator
spec:
containers:
- args:
- -alsologtostderr
- --lock-namespace
- kubeflow
- --kubectl-delivery-image
- docker.io/mpioperator/kubectl-delivery:latest
image: docker.io/mpioperator/mpi-operator:latest
imagePullPolicy: Always
name: mpi-operator
serviceAccountName: mpi-operator
---
apiVersion: app.k8s.io/v1beta1
kind: Application
metadata:
labels:
app.kubernetes.io/component: mpijob
app.kubernetes.io/name: mpi-operator
name: mpi-operator
spec:
componentKinds:
- group: apps
kind: Deployment
- group: core
kind: ServiceAccount
- group: kubeflow.org
kind: MPIJob
descriptor:
description: Mpi-operator allows users to create and manage the "MPIJob" custom
resource.
keywords:
- mpijob
- mpi-operator
links:
- description: About
url: https://github.com/kubeflow/mpi-operator
maintainers:
- email: rong.ou@gmail.com
name: Rong Ou
- email: terrytangyuan@gmail.com
name: Yuan Tang
- email: stp.abhi@gmail.com
name: Abhilash Pallerlamudi
owners:
- email: rong.ou@gmail.com
name: Rong Ou
- email: terrytangyuan@gmail.com
name: Yuan Tang
type: mpi-operator
version: v1
selector:
matchLabels:
app.kubernetes.io/component: mpijob
app.kubernetes.io/instance: mpi-operator
app.kubernetes.io/managed-by: kfctl
app.kubernetes.io/name: mpi-operator
app.kubernetes.io/part-of: kubeflow
app.kubernetes.io/version: v1.0
@terrytangyuan Do u know which commit does the mpioperator/mpi-operator:latest base?
It should be based on the latest commit https://github.com/kubeflow/mpi-operator/commit/75f424a802dafb3662bc5c76b8f3c3cb60127fac
@carmark @terrytangyuan @gaocegege Following up on this thread to check if you could 👀 and help in investigating the issue.
met the same issue.
@asahalyft I think the main point of error logs is "the server could not find the requested resource (put mpijobs.kubeflow.org ...". It means being a lack of subresources in crd manifest. After I added subresources to CRD and rerun it, resolved this issue for me:
@asahalyft Could you please try the suggestion of @qifengz ?
Also, for context/completeness, I am installing only the mpi operator component of Kubeflow and not the entire Kubeflow installation.
I applied the mpi controller as follows:
kustomize build manifests-v1.1.0-branch/mpi-job/mpi-operator/overlays/application/ > lyft-mpi-operator.yaml kubectl apply -f lyft-mpi-operator.yaml
This is the resultant lyft-mpi-operator.yaml that got generated from kustomize step.
apiVersion: apiextensions.k8s.io/v1beta1 kind: CustomResourceDefinition metadata: labels: app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator kustomize.component: mpi-operator name: mpijobs.kubeflow.org spec: group: kubeflow.org names: kind: MPIJob plural: mpijobs shortNames: - mj - mpij singular: mpijob scope: Namespaced versions: - name: v1alpha1 schema: openAPIV3Schema: properties: spec: description: Only one of gpus, processingUnits, or replicas should be specified oneOf: - properties: gpus: description: Valid values are 1, 2, 4, or any multiple of 8 oneOf: - enum: - 1 - 2 - 4 type: integer - minimum: 8 multipleOf: 8 type: integer title: Total number of GPUs gpusPerNode: description: Defaults to the number of GPUs per worker minimum: 1 title: The maximum number of GPUs available per node type: integer slotsPerWorker: description: Defaults to the number of processing units per worker minimum: 1 title: The number of slots per worker used in hostfile type: integer required: - gpus - properties: processingResourceType: description: Defaults to 'nvidia.com/gpu' enum: - nvidia.com/gpu - cpu title: The processing resource type, e.g. 'nvidia.com/gpu' or 'cpu' type: string processingUnits: description: Valid values are 1, 2, 4, or any multiple of 8 oneOf: - enum: - 1 - 2 - 4 type: integer - minimum: 8 multipleOf: 8 type: integer title: Total number of processing units processingUnitsPerNode: description: Defaults to the number of processing units per worker minimum: 1 title: The maximum number of processing units available per node type: integer slotsPerWorker: description: Defaults to the number of processing units per worker minimum: 1 title: The number of slots per worker used in hostfile type: integer required: - processingUnits - properties: processingResourceType: description: Defaults to 'nvidia.com/gpu' enum: - nvidia.com/gpu - cpu title: The processing resource type, e.g. 'nvidia.com/gpu' or 'cpu' type: string replicas: description: The processing resource limit should be specified for each replica minimum: 1 title: Total number of replicas type: integer slotsPerWorker: description: Defaults to the number of processing units per worker minimum: 1 title: The number of slots per worker used in hostfile type: integer required: - replicas title: The MPIJob spec served: false storage: false - name: v1alpha2 schema: openAPIV3Schema: properties: spec: properties: mpiReplicaSpecs: properties: Launcher: properties: replicas: maximum: 1 minimum: 1 type: integer Worker: properties: replicas: minimum: 1 type: integer slotsPerWorker: minimum: 1 type: integer served: true storage: false - name: v1 schema: openAPIV3Schema: properties: spec: properties: mpiReplicaSpecs: properties: Launcher: properties: replicas: maximum: 1 minimum: 1 type: integer Worker: properties: replicas: minimum: 1 type: integer slotsPerWorker: minimum: 1 type: integer served: true storage: true --- apiVersion: v1 kind: ServiceAccount metadata: labels: app: mpi-operator app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator kustomize.component: mpi-operator name: mpi-operator namespace: kubeflow --- aggregationRule: clusterRoleSelectors: - matchLabels: rbac.authorization.kubeflow.org/aggregate-to-kubeflow-mpijobs-admin: "true" apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator kustomize.component: mpi-operator rbac.authorization.kubeflow.org/aggregate-to-kubeflow-admin: "true" name: kubeflow-mpijobs-admin rules: [] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator kustomize.component: mpi-operator rbac.authorization.kubeflow.org/aggregate-to-kubeflow-edit: "true" rbac.authorization.kubeflow.org/aggregate-to-kubeflow-mpijobs-admin: "true" name: kubeflow-mpijobs-edit rules: - apiGroups: - kubeflow.org resources: - mpijobs - mpijobs/status verbs: - get - list - watch - create - delete - deletecollection - patch - update --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator kustomize.component: mpi-operator rbac.authorization.kubeflow.org/aggregate-to-kubeflow-view: "true" name: kubeflow-mpijobs-view rules: - apiGroups: - kubeflow.org resources: - mpijobs - mpijobs/status verbs: - get - list - watch --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: labels: app: mpi-operator app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator kustomize.component: mpi-operator name: mpi-operator rules: - apiGroups: - "" resources: - configmaps - serviceaccounts verbs: - create - list - watch - apiGroups: - "" resources: - pods verbs: - get - list - watch - apiGroups: - "" resources: - pods/exec verbs: - create - apiGroups: - "" resources: - endpoints verbs: - create - get - update - apiGroups: - "" resources: - events verbs: - create - patch - apiGroups: - rbac.authorization.k8s.io resources: - roles - rolebindings verbs: - create - list - watch - apiGroups: - policy resources: - poddisruptionbudgets verbs: - create - list - update - watch - apiGroups: - apps resources: - statefulsets verbs: - create - list - update - watch - apiGroups: - batch resources: - jobs verbs: - create - list - update - watch - apiGroups: - apiextensions.k8s.io resources: - customresourcedefinitions verbs: - create - get - apiGroups: - kubeflow.org resources: - mpijobs - mpijobs/finalizers - mpijobs/status verbs: - '*' - apiGroups: - scheduling.incubator.k8s.io - scheduling.sigs.dev resources: - queues - podgroups verbs: - '*' --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: labels: app: mpi-operator app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator kustomize.component: mpi-operator name: mpi-operator roleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: mpi-operator subjects: - kind: ServiceAccount name: mpi-operator namespace: kubeflow --- apiVersion: v1 data: kubectl-delivery-image: docker.io/mpioperator/kubectl-delivery:latest lock-namespace: kubeflow kind: ConfigMap metadata: labels: app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator kustomize.component: mpi-operator name: mpi-operator-config namespace: kubeflow --- apiVersion: apps/v1 kind: Deployment metadata: labels: app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator kustomize.component: mpi-operator name: mpi-operator namespace: kubeflow spec: replicas: 1 selector: matchLabels: app: mpi-operator app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator kustomize.component: mpi-operator template: metadata: annotations: sidecar.istio.io/inject: "false" labels: app: mpi-operator app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator kustomize.component: mpi-operator spec: containers: - args: - -alsologtostderr - --lock-namespace - kubeflow - --kubectl-delivery-image - docker.io/mpioperator/kubectl-delivery:latest image: docker.io/mpioperator/mpi-operator:latest imagePullPolicy: Always name: mpi-operator serviceAccountName: mpi-operator --- apiVersion: app.k8s.io/v1beta1 kind: Application metadata: labels: app.kubernetes.io/component: mpijob app.kubernetes.io/name: mpi-operator name: mpi-operator spec: componentKinds: - group: apps kind: Deployment - group: core kind: ServiceAccount - group: kubeflow.org kind: MPIJob descriptor: description: Mpi-operator allows users to create and manage the "MPIJob" custom resource. keywords: - mpijob - mpi-operator links: - description: About url: https://github.com/kubeflow/mpi-operator maintainers: - email: rong.ou@gmail.com name: Rong Ou - email: terrytangyuan@gmail.com name: Yuan Tang - email: stp.abhi@gmail.com name: Abhilash Pallerlamudi owners: - email: rong.ou@gmail.com name: Rong Ou - email: terrytangyuan@gmail.com name: Yuan Tang type: mpi-operator version: v1 selector: matchLabels: app.kubernetes.io/component: mpijob app.kubernetes.io/instance: mpi-operator app.kubernetes.io/managed-by: kfctl app.kubernetes.io/name: mpi-operator app.kubernetes.io/part-of: kubeflow app.kubernetes.io/version: v1.0
update crd and ClusterRole to v0.2.3, then try
Thanks @qifengz .
@carmark @chongchuanbing Weird. I have moved away from Kubeflow kustomize and applying operator yaml directly now. The feedback on this issue had been extremely slow, so I had to downgrade to mpi-operator v1alpha2
and applied the https://github.com/kubeflow/mpi-operator/tree/master/deploy/v1alpha2 to make some progress on my end.
The v1
operator from https://github.com/kubeflow/mpi-operator/tree/master/deploy/v1 still did not work for me.
Both CRD defs from v1
and v1alpha2
have the section
subresources:
status: {}
Are there any clear benefits of using v1
over v1alpha2
?
Thanks @qifengz .
@carmark @chongchuanbing Weird. I have moved away from Kubeflow kustomize and applying operator yaml directly now. The feedback on this issue had been extremely slow, so I had to downgrade to mpi-operator
v1alpha2
and applied the https://github.com/kubeflow/mpi-operator/tree/master/deploy/v1alpha2 to make some progress on my end.The
v1
operator from https://github.com/kubeflow/mpi-operator/tree/master/deploy/v1 still did not work for me.Both CRD defs from
v1
andv1alpha2
have the sectionsubresources: status: {}
Are there any clear benefits of using
v1
overv1alpha2
?
what's error about v1
, i run this ok.
@asahalyft I did test it locally with v1, but installed it with yaml file, it works as expected.
Yes, you can switch to v1alpha2
, longer version.
I'm encountering this issue, is there a quick fix/hack to this issue?
Please provide more information: the version you are running, the api version you are using, etc.
Let me know what else to query...
kubeflow.org/v1 kubeflow.org/v1alpha1 kubeflow.org/v1alpha2 kubeflow.org/v1beta1 Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.4", GitCommit:"e87da0bd6e03ec3fea7933c4b5263d151aafd07c", GitTreeState:"clean", BuildDate:"2021-02-18T16:03:00Z", GoVersion:"go1.15.8", Compiler:"gc", Platform:"linux/amd64"}
As for my job manifest:
apiVersion: kubeflow.org/v1beta1 kind: MPIJob metadata: ...
Can you confirm which version of the mpi-operator you are running? I suppose you can have a look at the deployment image.
via [k get all -n mpi-operator -oyaml > mpi-operator-k-get-all.yaml] : .../docker.io/mpi-operator:v0.2.3
Are you getting the same kind of error in the logs?
error syncing 'asaha/tf2-keras-mnist-mpi-gpu': the server could not find the requested resource (put mpijobs.kubeflow.org tf2-keras-mnist-mpi-gpu)
If so, I believe the manifest here: https://github.com/kubeflow/mpi-operator/blob/master/deploy/v1/mpi-operator.yaml shouldn't have the problem mentioned here https://github.com/kubeflow/mpi-operator/issues/297#issuecomment-758612853
Couldn't find that line. Below is a chunk of the mpi pod log.
I0923 17:33:45.775485 1 server.go:88] Using cluster scoped operator I0923 17:33:45.775535 1 server.go:94] [API Version: v1alpha2 Version: v0.2.2 Git SHA: aa96794299fa336f2132fb15fe98cc4b7f1d2599 Built: 2020-05-19 15:51:49 Go Version: go1.13.6 Go OS/Arch: linux/amd64] I0923 17:33:45.775550 1 server.go:97] Server options: &{Kubeconfig: MasterURL: KubectlDeliveryImage:dtr.thefacebook.com/docker.io/kubectl-delivery:v0.2.3 Threadiness:2 MonitoringPort:0 PrintVersion:false GangSchedulingName: Namespace: LockNamespace:mpi-operator} W0923 17:33:45.775614 1 client_config.go:541] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0923 17:33:45.862667 1 leaderelection.go:235] attempting to acquire leader lease mpi-operator/mpi-operator... I0923 17:33:45.862685 1 server.go:204] Start listening to 8080 for health check I0923 17:33:45.879171 1 leaderelection.go:245] successfully acquired lease mpi-operator/mpi-operator I0923 17:33:45.879360 1 server.go:242] Leading started I0923 17:33:45.879321 1 event.go:258] Event(v1.ObjectReference{Kind:"Endpoints", Namespace:"mpi-operator", Name:"mpi-operator", UID:"259796c3-a574-4026-abcd-fb5b72ea789a", APIVersion:"v1", ResourceVersion:"116386418", FieldPath:""}): type: 'Normal' reason: 'LeaderElection' mpi-operator-6d7f5ff98c-v9dz6_69fcf4a1-a96e-45d3-b00c-f783074a903a became leader I0923 17:33:45.879731 1 mpi_job_controller.go:219] Setting up event handlers I0923 17:33:45.879830 1 mpi_job_controller.go:353] Starting MPIJob controller I0923 17:33:45.879857 1 mpi_job_controller.go:356] Waiting for informer caches to sync I0923 17:33:46.380088 1 mpi_job_controller.go:366] Starting workers I0923 17:33:46.380110 1 mpi_job_controller.go:372] Started workers I0923 17:33:46.380426 1 event.go:258] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"team", Name:"mpiexampledemo28", UID:"8e2a62c7-149b-44cd-a8e0-0a1362263ec2", APIVersion:"kubeflow.org/v1alpha2", ResourceVersion:"116149843", FieldPath:""}): type: 'Normal' reason: 'MPIJobSucceeded' MPIJob team/mpiexampledemo28 successfully completed. I0923 17:33:46.380515 1 event.go:258] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"team", Name:"tensorflow-benchmarks", UID:"3aab9e68-4915-4e54-8a2b-e438332967a8", APIVersion:"kubeflow.org/v1alpha2", ResourceVersion:"115928376", FieldPath:""}): type: 'Warning' reason: 'MPIJobFailed' MPIJob team/tensorflow-benchmarks has failed I0923 17:33:46.383235 1 mpi_job_controller.go:447] Finished syncing job "team/tensorflow-benchmarks" (3.059812ms) E0923 17:33:46.383283 1 mpi_job_controller.go:434] error syncing 'team/tensorflow-benchmarks': mpijobs.kubeflow.org "tensorflow-benchmarks" not found
error syncing 'team/tensorflow-benchmarks': mpijobs.kubeflow.org "tensorflow-benchmarks" not found
Uhm.... And the job still exists?
Not sure about the one in the default namespace. Here's a more current one: Log from MPI-Operator pod: I1006 23:24:18.960746 1 mpi_job_controller.go:447] Finished syncing job "team/mpiexampledemo44" (70.336037ms) E1006 23:24:18.960789 1 mpi_job_controller.go:434] error syncing 'team/mpiexampledemo44': mpijobs.kubeflow.org "mpiexampledemo44" not found I1006 23:24:18.961804 1 mpi_job_controller.go:447] Finished syncing job "team/mpiexampledemo44" (987.295µs) E1006 23:24:18.961843 1 mpi_job_controller.go:434] error syncing 'team/mpiexampledemo44': mpijobs.kubeflow.org "mpiexampledemo44" not found I1006 23:24:19.601510 1 mpi_job_controller.go:447] Finished syncing job "team/mpiexampledemo44" (1.515862ms) E1006 23:24:19.601580 1 mpi_job_controller.go:434] error syncing 'team/mpiexampledemo44': mpijobs.kubeflow.org "mpiexampledemo44" not found I1006 23:24:40.083734 1 mpi_job_controller.go:447] Finished syncing job "team/mpiexampledemo44" (1.978215ms) E1006 23:24:40.083784 1 mpi_job_controller.go:434] error syncing 'team/mpiexampledemo44': mpijobs.kubeflow.org "mpiexampledemo44" not found I1006 23:25:36.164449 1 event.go:258] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"default", Name:"tensorflow-benchmarks", UID:"2d778959-dc12-44b2-9181-6757964ffa55", APIVersion:"kubeflow.org/v1alpha2", ResourceVersion:"123245452", FieldPath:""}): type: 'Warning' reason: 'MPIJobFailed' MPIJob default/tensorflow-benchmarks has failed I1006 23:25:36.169236 1 mpi_job_controller.go:447] Finished syncing job "default/tensorflow-benchmarks" (4.971223ms) E1006 23:25:36.169283 1 mpi_job_controller.go:434] error syncing 'default/tensorflow-benchmarks': mpijobs.kubeflow.org "tensorflow-benchmarks" not found I1006 23:27:23.924141 1 event.go:258] Event(v1.ObjectReference{Kind:"MPIJob", Namespace:"team", Name:"mpiexampledemo44", UID:"979808dc-f477-426e-992d-75a43ea9cc2c", APIVersion:"kubeflow.org/v1alpha2", ResourceVersion:"124469782", FieldPath:""}): type: 'Normal' reason: 'MPIJobSucceeded' MPIJob team/mpiexampledemo44 successfully completed. I1006 23:27:23.926522 1 mpi_job_controller.go:447] Finished syncing job "team/mpiexampledemo44" (2.542902ms) E1006 23:27:23.926596 1 mpi_job_controller.go:434] error syncing 'team/mpiexampledemo44': mpijobs.kubeflow.org "mpiexampledemo44" not found
=========================================================================================================================================
other info: kubectl get pods NAME READY STATUS RESTARTS AGE ... mpiexampledemo44-launcher-9vbvv 0/1 Completed 0 17m mpiexampledemo44-worker-0 1/1 Running 0 17m mpiexampledemo44-worker-1 1/1 Running 0 17m mpiexampledemo44-worker-2 1/1 Running 0 17m
kubectl get mpijobs NAME AGE mpiexampledemo44 33m
So something external to the controller seems to be removing the mpijob, and perhaps it's doing that with cascade=false.
Hi asahalyft, In my case it is possible to cleanPod succeed, You can try
runPolicy: cleanPodPolicy: Running
e.g.
apiVersion: kubeflow.org/v1
kind: MPIJob
metadata:
name: tf2-keras-mnist-mpi-gpu
spec:
slotsPerWorker: 1
runPolicy:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
spec:
containers:
- image: docker.io/horovod/horovod:0.19.3-tf2.1.0-torch-mxnet1.6.0-py3.6-gpu
name: keras-mnist-mpi-launcher
command:
- mpirun
args:
- -np
- "2"
- --allow-run-as-root
- -bind-to
- none
- -map-by
- slot
- -x
- LD_LIBRARY_PATH
- -x
- PATH
- -mca
- pml
- ob1
- -mca
- btl
- ^openib
- python
- /examples/tensorflow2_keras_mnist.py
resources:
limits:
cpu: 1
memory: 2Gi
Worker:
replicas: 2
template:
spec:
containers:
- image: docker.io/horovod/horovod:0.19.3-tf2.1.0-torch-mxnet1.6.0-py3.6-gpu
name: keras-mnist-mpi-worker
resources:
limits:
nvidia.com/gpu: 1
Met the same issue, and solution from @hongyonggan doesn't work for me.
I applied the training operator as follows:
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
kubectl logs -f training-operator-5cfdcb7d9d-nvf57 -n kubeflow
:
time="2022-04-02T15:07:31Z" level=info msg="MPIJob default/tensorflow2-keras-mnist-elastic is created."
time="2022-04-02T15:07:31Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:31.826Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "SuccessfulCreatePod", "message": "Created worker pod: tensorflow2-keras-mnist-elastic-worker-0"}
2022-04-02T15:07:31.851Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "SuccessfulCreatePod", "message": "Created worker pod: tensorflow2-keras-mnist-elastic-worker-1"}
2022-04-02T15:07:31.857Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "MPIJobRunning", "message": "launcher pod created success: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.857Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.857Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:31Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
2022-04-02T15:07:31.857Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:31Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:31Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:31Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (93.538317ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:31Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:31.952Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.952Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.952Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.952Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.952Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:31.952Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010771"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:31Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:31Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:31Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:32Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (97.54632ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:32Z" level=warning msg="Reconcile MPIJob error Operation cannot be fulfilled on mpijobs.kubeflow.org \"tensorflow2-keras-mnist-elastic\": the object has been modified; please apply your changes to the latest version and try again"
2022-04-02T15:07:32.050Z ERROR controller-runtime.manager.controller.mpijob-controller Reconciler error {"name": "tensorflow2-keras-mnist-elastic", "namespace": "default", "error": "Operation cannot be fulfilled on mpijobs.kubeflow.org \"tensorflow2-keras-mnist-elastic\": the object has been modified; please apply your changes to the latest version and try again"}
time="2022-04-02T15:07:32Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:32.051Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.051Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.051Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.051Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.051Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:32Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
2022-04-02T15:07:32.051Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:32Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:32Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:32Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:32.056Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.056Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.056Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.056Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.057Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:32.057Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:32Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:32Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:32Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.310Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.323Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.323Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.323Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.323Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.323Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010792"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=1, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (14.157662ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.350Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.350Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=1, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
2022-04-02T15:07:34.350Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.350Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.350Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.350Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.828Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.828Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.828Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.828Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.828Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=1, succeeded=0 , failed=0"
2022-04-02T15:07:34.829Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.842Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.849Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.849Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.849Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.849Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.849Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (9.422552ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.859Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.859Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.859Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.860Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.860Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:34Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
2022-04-02T15:07:34.860Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010830"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:34Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (8.803506ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:34Z" level=warning msg="Reconcile MPIJob error Operation cannot be fulfilled on mpijobs.kubeflow.org \"tensorflow2-keras-mnist-elastic\": the object has been modified; please apply your changes to the latest version and try again"
2022-04-02T15:07:34.869Z ERROR controller-runtime.manager.controller.mpijob-controller Reconciler error {"name": "tensorflow2-keras-mnist-elastic", "namespace": "default", "error": "Operation cannot be fulfilled on mpijobs.kubeflow.org \"tensorflow2-keras-mnist-elastic\": the object has been modified; please apply your changes to the latest version and try again"}
time="2022-04-02T15:07:34Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:34.869Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.870Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.870Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:34.870Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.251Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.251Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:35Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:35Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:35Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:35Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:35.251Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.251Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.251Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.252Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:35.252Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:35Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:35Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
2022-04-02T15:07:35.252Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:35Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:37Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:37.644Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:37.644Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:37.644Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:37.645Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:37.645Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:37.645Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:37Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=0, succeeded=0 , failed=0"
time="2022-04-02T15:07:37Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:37Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:38Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:38.642Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.642Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.642Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.642Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "MPIJobRunning", "message": "MPIJob default/tensorflow2-keras-mnist-elastic is running"}
2022-04-02T15:07:38.642Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.643Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.643Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:38Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=1, succeeded=0 , failed=0"
time="2022-04-02T15:07:38Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
time="2022-04-02T15:07:38Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
2022-04-02T15:07:38.643Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010835"}, "reason": "MPIJobRunning", "message": "MPIJob default/tensorflow2-keras-mnist-elastic is running"}
time="2022-04-02T15:07:38Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (15.886545ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:07:38Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
2022-04-02T15:07:38.659Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.659Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.660Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.660Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "MPIJobRunning", "message": "MPIJob default/tensorflow2-keras-mnist-elastic is running"}
2022-04-02T15:07:38.660Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "ServiceAccount is exist", "message": "ServiceAccount: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.660Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "LauncherRole is exist", "message": "LauncherRole: tensorflow2-keras-mnist-elastic-launcher"}
time="2022-04-02T15:07:38Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=1, running=1, succeeded=0 , failed=0"
time="2022-04-02T15:07:38Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Worker expected=2, running=2, succeeded=0 , failed=0"
2022-04-02T15:07:38.660Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "RoleBinding is exist", "message": "RoleBinding: tensorflow2-keras-mnist-elastic-launcher"}
2022-04-02T15:07:38.660Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "MPIJobRunning", "message": "MPIJob default/tensorflow2-keras-mnist-elastic is running"}
time="2022-04-02T15:07:38Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is running." job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:13:49Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
time="2022-04-02T15:13:49Z" level=info msg="MPIJob=tensorflow2-keras-mnist-elastic, ReplicaType=Launcher expected=0, running=0, succeeded=1 , failed=0"
time="2022-04-02T15:13:49Z" level=info msg="MPIJob tensorflow2-keras-mnist-elastic is successfully completed."
2022-04-02T15:13:49.503Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "MPIJobSucceeded", "message": "MPIJob default/tensorflow2-keras-mnist-elastic successfully completed."}
2022-04-02T15:13:49.504Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "MPIJobSucceeded", "message": "MPIJob default/tensorflow2-keras-mnist-elastic successfully completed."}
2022-04-02T15:13:49.504Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"MPIJob","namespace":"default","name":"tensorflow2-keras-mnist-elastic","uid":"a8c0ab21-54e3-40b4-afde-de00628c7fb6","apiVersion":"kubeflow.org/v1","resourceVersion":"5010846"}, "reason": "JobSucceeded", "message": "MPIJob tensorflow2-keras-mnist-elastic is successfully completed."}
time="2022-04-02T15:13:49Z" level=info msg="Finished updating MpiJobs Status \"tensorflow2-keras-mnist-elastic\" (30.703899ms)" job=default.tensorflow2-keras-mnist-elastic uid=a8c0ab21-54e3-40b4-afde-de00628c7fb6
time="2022-04-02T15:13:49Z" level=info msg="Reconciling for job tensorflow2-keras-mnist-elastic"
Met the same issue, and solution from @hongyonggan doesn't work for me.
I applied the training operator as follows:
kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone"
kubectl logs -f training-operator-5cfdcb7d9d-nvf57 -n kubeflow
:
Resolved by updating manifests with latest image tag (8c4323194a09b3cabd36248cc172493006f71b75), as it is fixed by https://github.com/kubeflow/training-operator/pull/1550
I am testing out TFJob and MPIJob operators from kubeflow/manifests-v1.1.0 branch on AWS EKS K8s=1.14. I am able to schedule TFJob and MPIJob successfully and these jobs also complete fine. I have verified it by watching the k8s events that I get MPIJob completed successfully.
However, I observe that the worker pods still keep on running even after the launcher pod and the MPI Job has completed. I see this behavior for TFJob as well.
I have applied the following yaml for the MPIJob.
I collected the mpi controller logs which clearly shows that the MPIJob is indeed completed but the controller is failing to sync that info.
Observe the pods in my namespace:
Observe the mpi controller logs in kubeflow namespace:
Describe the MPIJob: