Distributed training with kubeflow's MPIJob and horovod

tkatila commented 8 months ago

This is more of a question and not specifically targeted to ITEX.

I've been trying to use ITEX XPU docker image with k8s Kubeflow's MPIJob. The MPIJob allows creating a scenario where training occurs between multiple containers (k8s Pods). The training is facilitated by a Launcher Pod that will access Worker Pods to run code on them. Similar to local distributed training but instead of spawning new processes, execution is handled via remote shell.

The topology is such that the Launcher has no GPUs attached to it. Workers have one GPU per instance. Base idea is that by increasing the Worker count, one could scale the training.

When I use mpirun to start the training on the Launcher, the execution succeeds and Pods with GPUs attached to them are utilized by the Launcher. mpirun -hosts worker1,worker2 -np 2 -ppn 1 python3 tensorflow2_keras_mnist.py

When I use horovodrun to start the training I get the following error:

# horovodrun -np 2 --hostfile /etc/mpi/hostfile python3 tensorflow2_keras_mnist.py
...
2024-03-21 08:53:19.292348: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.                                                                                             │
│ 2024-03-21 08:53:19.292557: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.                                          │
│ To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.                                    │
│ 2024-03-21 08:53:20.006208: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT                                                                                                         │
│ terminate called after throwing an instance of 'sycl::_V1::runtime_error'                                                                                                                                                           │
│   what():  No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (PI_ERROR_DEVICE_NOT_FOUND)                             │

When I tried the same scenario previously (~Q2'23), the execution worked with horovodrun and mpirun. Also, if I run everything on the same container, horovodrun works as it is able to access GPUs.

My question: Is this a supported scenario for horovodrun?

The training that I'm running is this: https://github.com/intel/intel-extension-for-tensorflow/tree/main/examples/train_horovod/mnist

xiguiw commented 7 months ago

@tkatila

"horovodrun -np 2 --hostfile /etc/mpi/hostfile python3 tensorflow2_keras_mnist.py" Doe the command run on the Launcher Pod (which has no GPUs attached to it) or the Workers?

It seems the either ITEX or some othere component checks there is local GPU device.

In the document "https://github.com/intel/intel-extension-for-tensorflow/tree/main/examples/train_horovod/mnist", it seems for local GPU devices

Let me investigate if there is such check in horovodrun.

xiguiw commented 7 months ago

@tkatila

Could you do some test to confirm which component output the log?

Check if ITEX check GPU device and output the log? export ITEX_VERBOSE = 1, then run your command to launch service:

horovodrun -np 2 --hostfile /etc/mpi/hostfile python3 tensorflow2_keras_mnist.py
run the training on a environment where Launcher Pod with XPU ( still the traning are distribued to Worker Pods GPU)? Thanks!

horovodrun -np 2 --hostfile /etc/mpi/hostfile python3 tensorflow2_keras_mnist.py

tkatila commented 7 months ago

ITEX_VERBOSE doesn't seem to add much to the logs. This is the full log:

2024-03-25 07:03:22.408454: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-25 07:03:22.410442: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-25 07:03:22.444453: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-25 07:03:22.444489: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-25 07:03:22.444517: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-25 07:03:22.451584: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-25 07:03:22.451835: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
(xpum-observer) testrunner@7cc25526b77a:~/tkatila$ kubectl logs tensorflow-mnist-launcher 
Defaulted container "mpi" out of: mpi, kubectl-delivery (init)
2024-03-25 07:03:22.408454: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-25 07:03:22.410442: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-25 07:03:22.444453: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-25 07:03:22.444489: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-25 07:03:22.444517: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-25 07:03:22.451584: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-25 07:03:22.451835: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-25 07:03:23.182621: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what():  No device of requested type available. Please check https://software.intel.com/content/www/us/en/develop/articles/intel-oneapi-dpcpp-system-requirements.html -1 (PI_ERROR_DEVICE_NOT_FOUND)
Traceback (most recent call last):
File "/usr/local/bin/horovodrun", line 8, in <module>
sys.exit(run_commandline())
File "/usr/local/lib/python3.10/dist-packages/horovod/runner/launch.py", line 837, in run_commandline
_run(args)
File "/usr/local/lib/python3.10/dist-packages/horovod/runner/launch.py", line 827, in _run
return _run_static(args)
File "/usr/local/lib/python3.10/dist-packages/horovod/runner/launch.py", line 685, in _run_static
_launch_job(args, settings, nics, command)
File "/usr/local/lib/python3.10/dist-packages/horovod/runner/launch.py", line 800, in _launch_job
run_controller(args.use_gloo, gloo_run_fn,
File "/usr/local/lib/python3.10/dist-packages/horovod/runner/launch.py", line 770, in run_controller
if mpi_built(verbose=verbose):
File "/usr/local/lib/python3.10/dist-packages/horovod/common/util.py", line 122, in wrapper
retval = f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/horovod/common/util.py", line 140, in mpi_built
result = _check_extension_lambda(
File "/usr/local/lib/python3.10/dist-packages/horovod/common/util.py", line 104, in _check_extension_lambda
return queue.get_nowait()
File "/usr/lib/python3.10/multiprocessing/queues.py", line 135, in get_nowait
return self.get(False)
File "/usr/lib/python3.10/multiprocessing/queues.py", line 116, in get
raise Empty
_queue.Empty

If I add a GPU to the Launcher Pod, the scenario does work. Kinda. In a 1xLauncher + 2xWorker scenario with 1 GPU per entity, Launcher's and Worker1's GPUs are being utilized. I can get the Worker2's GPU to be utilized by setting -np 3 with horovod. But the configuration is a bit wonky as the job is configured only with two Workers.

xiguiw commented 7 months ago

@tkatila Thanks for the testing.

"2024-03-25 07:03:23.182621: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT terminate called after throwing an instance of 'sycl::_V1::runtime_error'"

It looks the error is triggered from tensorflow.
With GPU added to the Launcher pod it works. But the GPU of the launcher is not necessary (in your case --hostfile do not including the launcher Pod, but workers. Right?)
There is no reason horovod should check the GPU in launcher, especially mpirun success.

It looks like some setting wrong in horovodrun, that leads to check the seting(xpu) device on launcher. I'll check what in py_utils.cc.

tkatila commented 7 months ago

in your case --hostfile do not including the launcher Pod, but workers. Right?

Yes. The hostfile includes only workers.

xiguiw commented 7 months ago

@tkatila

I created the similar cases without k8s. horovodrun works well. So it seems the issue is some setting error or compatbile problem between horovod and K8s.

Here is my test:

Set up horovod on 3 platform: machine A with ARC and iGPU. machine B with Flex GPU machine C with SPR. No GPU.
horovodrun on machine B,

horovodrun -np 1 -H machine_A:1 python tensorflow2_keras_mnist.py

The traning runs on machine A successfully.
horovodrun on machine C This is similar to your case, where there is no GPU on launcher. The traning runs on machine A successfully, too.

I list the detail environment and log here for your reference.

environment

intel-optimization-for-horovod 0.28.1.3                 pypi_0    pypi
intel-extension-for-tensorflow 2.14.0.2                 pypi_0    pypi
intel-extension-for-tensorflow-lib 2.14.0.2.2               pypi_0    pypi
tensorflow                2.14.1                   pypi_0    pypi
tensorflow-addons         0.23.0                   pypi_0    pypi
tensorflow-datasets       4.9.3                    pypi_0    pypi
tensorflow-estimator      2.14.0                   pypi_0    pypi
tensorflow-io-gcs-filesystem 0.36.0                   pypi_0    pypi
tensorflow-metadata       1.14.0                   pypi_0    pypi
tensorflow-model-optimization 0.8.0                    pypi_0    pypi

horovodrun on launcher without GPU. The log output:

(horovod) xiguiwang@a4bf01945e87:~$ horovodrun    -np 1 -H machine_A:1 python tensorflow2_keras_mnist.py
Filtering local host names.
Remote host found: --
Checking ssh on all remote hosts.
SSH was successful into all the remote hosts.
2024-03-26 00:50:30.641544: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-03-26 00:50:30.643392: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-26 00:50:30.672002: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-03-26 00:50:30.672040: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-03-26 00:50:30.672065: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-03-26 00:50:30.678819: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2024-03-26 00:50:30.678996: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 AVX_VNNI AMX_TILE AMX_INT8 AMX_BF16 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-03-26 00:50:31.239791: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
2024-03-26 00:50:31.475542: W itex/core/wrapper/itex_gpu_wrapper.cc:32] Could not load dynamic library: libze_loader.so.1: cannot open shared object file: No such file or directory
2024-03-26 00:50:31.560327: I itex/core/wrapper/itex_cpu_wrapper.cc:60] Intel Extension for Tensorflow* AVX512 CPU backend is loaded.
2024-03-26 00:50:31.603209: E itex/core/wrapper/itex_gpu_wrapper.cc:49] Could not load Intel Extension for Tensorflow* GPU backend, GPU will not be used.
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
2024-03-26 00:50:31.603439: E itex/core/wrapper/itex_gpu_wrapper.cc:49] Could not load Intel Extension for Tensorflow* GPU backend, GPU will not be used.
If you need help, create an issue at https://github.com/intel/intel-extension-for-tensorflow/issues
mpirun -l -np 1 -ppn 1 -hosts 10.239.44.84     -genv NCCL_SOCKET_IFNAME=eth0    python tensorflow2_keras_mnist.py
[0] 2024-03-26 15:44:40.049681: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
[0] 2024-03-26 15:44:40.050865: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[0] 2024-03-26 15:44:40.067902: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
[0] 2024-03-26 15:44:40.067918: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
[0] 2024-03-26 15:44:40.067936: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
[0] 2024-03-26 15:44:40.071714: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
[0] 2024-03-26 15:44:40.071825: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
[0] To enable the following instructions: AVX2 AVX_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
[0] 2024-03-26 15:44:40.437098: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
[0] 2024-03-26 15:44:40.853276: I itex/core/wrapper/itex_gpu_wrapper.cc:35] Intel Extension for Tensorflow* GPU backend is loaded.
[0] 2024-03-26 15:44:40.878750: I itex/core/wrapper/itex_cpu_wrapper.cc:70] Intel Extension for Tensorflow* AVX2 CPU backend is loaded.
[0] 2024-03-26 15:44:40.930885: I itex/core/devices/gpu/itex_gpu_runtime.cc:129] Selected platform: Intel(R) Level-Zero
[0] 2024-03-26 15:44:40.931072: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] 2024-03-26 15:44:40.931075: I itex/core/devices/gpu/itex_gpu_runtime.cc:154] number of sub-devices is zero, expose root device.
[0] /home/xiguiwang/anaconda3/envs/horovod/lib/python3.9/site-packages/horovod/common/util.py:258: UserWarning: Framework tensorflow installed with version 2.14.0 but found version 2.14.1.
[0]              This can result in unexpected behavior including runtime errors.
[0]              Reinstall Horovod using `pip install --no-cache-dir` to build with the new version.
[0]   warnings.warn(get_version_mismatch_message(name, version, installed_version))
[0] 2024-03-26 15:44:42.216786: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:306] Could not identify NUMA node of platform XPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
[0] 2024-03-26 15:44:42.216822: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:272] Created TensorFlow device (/job:localhost/replica:0/task:0/device:XPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: XPU, pci bus id: <undefined>)
[0] Horovod size 1
[0] XPU count is 2
[0] XPU: PhysicalDevice(name='/physical_device:XPU:0', device_type='XPU')
[0] XPU: PhysicalDevice(name='/physical_device:XPU:1', device_type='XPU')
[0] Epoch 1/24
[0] 2024-03-26 15:44:43.251961: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type CPU is enabled.
  1/500 [..............................] - ETA: 11:51 - loss: 2.2850 - accuracy: 0.1484[0] WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time492/500 [============================>.] - ETA: 0s - loss: 0.2389 - accuracy: 0.9273[0] /home/xiguiwang/anaconda3/envs/horovod/lib/python3.9/site-packages/keras/src/engine/training.py:3079: UserWarning: You are saving your model as an HDF5 file via `model.save()`. This file format is considered legacy. We recommend using instead the native Keras format, e.g. `model.save('my_model.keras')`.
500/500 [==============================] - 4s 6ms/step - loss: 0.2367 - accuracy: 0.9280 - lr: 0.0010
[0] Epoch 2/24
500/500 [==============================] - 3s 6ms/step - loss: 0.0833 - accuracy: 0.9756 - lr: 0.0010
...

tkatila commented 7 months ago

Thanks for the validation!

I think I figured out the issue. I'm using xpu variant of the container for the launcher and the workers. I temporarily uninstalled the itex pip package and installed the cpu variant of the same package. Now the training worked on the Workers while the Launcher is not having a GPU.

In summary, horovod does not like to run if the itex package is the xpu variant and there's no GPUs on the host.

xiguiw commented 7 months ago

@tkatila It's great that you have your problem resolved.

Good to know this. Thank you for sharing this knowledge! Yes, the ITEX CPU/XPU package should match your platform.

intel / intel-extension-for-tensorflow

Distributed training with kubeflow's MPIJob and horovod #66

horovodrun -np 1 -H machine_A:1 python tensorflow2_keras_mnist.py