securefederatedai / openfl

An Open Framework for Federated Learning.
https://openfl.readthedocs.io/en/latest/index.html
Apache License 2.0
733 stars 207 forks source link

Incompatibility between OpenFL and Nvidia Edge devices with L4T image #1160

Closed HubgitCCL closed 49 minutes ago

HubgitCCL commented 2 days ago

Issue Summary:

I am currently working on federated learning tasks using OpenFL on NVIDIA Jetson devices. However, I am facing an issue where the model training is not utilizing the GPU, despite using a compatible version of TensorFlow. Specifically, after downgrading TensorFlow to a version that is compatible with the OpenFL framework, the training still does not use the GPU resources.

The root cause seems to be related to a mismatch between the versions of TensorFlow and CUDA. TensorFlow relies on specific versions of CUDA and cuDNN to enable GPU acceleration during training. When these versions are mismatched or incompatible, TensorFlow will not be able to use the GPU, even if a compatible version of TensorFlow is installed.

The challenge I am facing is that the Jetson devices are running an L4T-based Docker image, which comes with a pre-installed version of CUDA. This version is tightly integrated with the operating system and the NVIDIA hardware. Downgrading or changing the CUDA version is not a viable solution, as it could break compatibility with the existing system and cause instability. The pre-configured L4T container and its version of CUDA cannot be modified easily, making it difficult to align the required versions of TensorFlow and CUDA.

I am seeking advice or potential solutions to resolve this issue without the need to downgrade or modify the CUDA version on the Jetson devices.

The CUDA version in the container: root@ubuntu:/app# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Tue_Aug_15_22:08:11_PDT_2023 Cuda compilation tools, release 12.2, V12.2.140 Build cuda_12.2.r12.2/compiler.33191640_0

The TensorFlow version pre-installed in the L4T container:

tensorflow 2.14.0+nv23.11

With the "tensorflow 2.14.0+nv23.11", when you run the OpenFL, there will be a error :

Segmentation fault (core dumped)

So I downgraded TensorFlow to TensorFlow 2.13.0, then I was able to start the train, however, in this case, the training can' t be performed with GPU, possibly the incompatibility between the TensorFlow and CUDA.

teoparvanov commented 1 day ago

Hi @HubgitCCL, in an attempt to isolate the issue, have you tried running your training script with tensorflow 2.14.0+nv23.11 independently of OpenFL (on a single machine, using a single dataset shard)?

HubgitCCL commented 21 hours ago

Hi @teoparvanov, thank you for your advice, I tried to run my openfl code with tensorflow 2.14 and succeed,

then I tried to run the following command:

import tensorflow as tf; print('GPU Available: ', tf.config.list_physical_devices('GPU'))"'))

in the container on the device, and get results like:

double free or corruption (out)

which indicate the incompatibility should have nothing to do with Openfl.

Thanks for your help.

HubgitCCL commented 17 hours ago

Hi @teoparvanov, I figured out that some of OpenFL's dependencies were causing the incompatibility issues. To resolve this, I ran the following commands:

pip install --no-deps openfl pip install --no-deps click pip install --no-deps rich pip install --no-deps dynaconf pip install --no-deps tqdm pip install --no-deps tensorboardx

After doing this, I was able to train using the GPU. Thanks so much for your kind assistance, @teoparvanov!

teoparvanov commented 5 hours ago

Thanks for the update, @HubgitCCL, I'm glad that you were able to run your experiment with GPU acceleration. From our side we have started a comprehensive effort of updating the TensorFlow-based task runner dependencies. This will take some time, but I expect us to start making tangible progress in this regard over the upcoming 1.7 and 1.8 releases of OpenFL.

CC: @tanwarsh @kta-intel