ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
32.93k stars 5.57k forks source link

[Core | Jobs] Ray Cluster Affected by TensorFlow GPU Detection Bug #46632

Open sercanCyberVision opened 1 month ago

sercanCyberVision commented 1 month ago

What happened + What you expected to happen

This issue originates from TensorFlow, but Ray uses the affected version. Is there any action that can be taken on the Ray side to remedy this issue?

Please see the TF open issue https://github.com/tensorflow/tensorflow/issues/70960.

When the TensorFlow version is 2.15.1, TensorFlow cannot find the CUDA driver and defaults to using the CPU for the job: 2 15 1

When the version of TF is 2.13.0, the job is executed with GPU as expected: 2 13 0

Versions / Dependencies

Ray images: 2.24.0 Kuberay Operator: 1.1.1 TF: 2.15.1

Reproduction script

We submit below job:

import tensorflow as tf
import numpy as np

def main():
    print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
    if tf.config.list_physical_devices('GPU'):
        print("TensorFlow will run on GPU.")
    else:
        print("TensorFlow will run on CPU. Make sure your setup allows GPU usage.")

    tf.debugging.set_log_device_placement(True)

    # Creating a larger array of numbers
    data = np.arange(1, 100000)
    tensor = tf.constant(data, dtype=tf.float32)

    with tf.device('/GPU:0'):  
        squared_tensor = tf.square(tensor)

    print("Original data sample:", data[:10])
    print("Squared data sample:", squared_tensor.numpy()[:10])

if __name__ == "__main__":
    main()

With job submission client as below:

job_id = client.submit_job(
    entrypoint="python ray-gpu-example.py",
    runtime_env={
        "working_dir": "./"
    },
    entrypoint_num_gpus = 1,
    entrypoint_num_cpus = 1
)

Issue Severity

High: It blocks me from completing my task.

rynewang commented 1 month ago

@can-anyscale can you check out this known bad tensorflow version?

jjyao commented 1 month ago

Feel free to install a different version of TF that doesn't have this bug. We will upgrade tensorflow after they fix the issue.

can-anyscale commented 1 month ago

also might be related, each tensorflow requires a minimum cuda driver version, make sure you have the correct one installed https://www.tensorflow.org/install/source#gpu

sercanCyberVision commented 1 month ago

also might be related, each tensorflow requires a minimum cuda driver version, make sure you have the correct one installed https://www.tensorflow.org/install/source#gpu

Thank you @can-anyscale, it is a good point.

I have checked the driver, it looks like we are good as TF 2.15.X requires min 12.2:

[root@nvidia-driver-daemonset-xmjd8 drivers]# nvidia-smi
Tue Jul 23 13:56:46 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+

I have retested just to make sure, TF version 2.15.1 fails to find the driver, but 2.13.0 works well.