tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
186.37k stars 74.31k forks source link

On threading library using the GPU implementing multithreaded RL algorithm #58715

Closed alebldn closed 1 year ago

alebldn commented 1 year ago
Click to expand! ### Issue Type Bug ### Source binary ### Tensorflow Version (v2.11.0-rc2-17-gd5b57ca93e5 2.11.0) But tried many different 2.11 versions ### Custom Code Yes ### OS Platform and Distribution Ubuntu 22.04.1 LTS ### Mobile device Ubuntu 22.04.1 LTS ### Python version 3.10 and 3.8 ### Bazel version _No response_ ### GCC/Compiler version _No response_ ### CUDA/cuDNN version cuda 11.0 ### GPU model and memory _No response_ ### Current Behaviour? ```shell This is a bug when interacting with the threading library while using GPUs tested in docker environment using the latest tensorflow-gpu build. I even tried to update python to the latest version. The code is an implementation of Asynchronous Advantage Actor-Critic playing with the Pong library. Even if there might be errors in the code, it works flawlessly when using CPUs. Switching to GPU brings up errors. Two possible outcomes, depending on commenting the @tf.function code at line 224 of class Worker: - If @tf.function is not commented, the code shows a warning: "WARNING:tensorflow:5 out of the last 5 calls to triggered tf.function retracing". - If @tf.function is commented, many different errors show up, ranging from "TensorArray" not being used, to errors regarding reshape during gradients. These errors completely block the code execution. Again, using the CPU-based versions, none of these problems shows up. I personally don't know the source code of TF but this makes me think that it might be caused by naming conflicts for variables stored in the GPU. ``` ### Standalone code to reproduce the issue ```shell Here's a repo containing a comparison between the errors shown when using GPU and CPU while commenting tf.function https://github.com/occhialidaleso/A3C-TF-BUG ``` ### Relevant log output _No response_
SuryanarayanaY commented 1 year ago

Hi @occhialidaleso , You can safely ignore the warning.The reasons for this are mentioned below:

WARNING:tensorflow:5 out of the last 5 calls to triggered tf.function retracing.

The reasons for this are mentioned below:

Tracing is expensive and the excessive number of tracings could be due to 
(1) creating @tf.function repeatedly in a loop, 
(2) passing tensors with different shapes, 
(3) passing Python objects instead of tensors. 
For (1), please define your @tf.function outside of the loop. 
For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. 
For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing 
and https://www.tensorflow.org/api_docs/python/tf/function for  more details.

Please refer the resources attached here 1 & 2

alebldn commented 1 year ago

Kind sir, thanks for your reply. When using @tf.function on the gpu I already do ignore the warnings and it works fine, it does train decently. But, unfortunately, when commenting @tf.function the errors that show up don't allow me to execute the code further. (in the repo I posted two examples)

SuryanarayanaY commented 1 year ago

Hi @occhialidaleso , Could you please confirm whether you have followed all the steps mentioned here for enabling GPU support ?

Thankyou!

alebldn commented 1 year ago

Sorry, I thought I mentioned it but I didn't: I just used the other method available in the page and downloaded the docker image tensorflow/tensorflow:latest-gpu-jupyter from dockerhub as shown here.

SuryanarayanaY commented 1 year ago

Hi @occhialidaleso ,

Thanks for confirmation.Could you please confirm the output of the below code.Just to ensure GPU installation is successful or not.

docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu \
   python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

Also the attached code of you is very lengthy and debugging might be difficult and may take a long time.If possible could you please submit a minimal code snippet that can replicate your problem quickly to exactly address the issue.

Thankyou!

alebldn commented 1 year ago

The output of the provided command is:

######################################################################################## 2022-12-05 11:52:16.442813: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-12-05 11:52:18.095820: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-05 11:52:18.096095: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-05 11:52:18.098533: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-05 11:52:18.098818: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-05 11:52:18.099018: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-05 11:52:18.099184: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')] ########################################################################################

The output of nvidia-smi during training is:

######################################################################################## | 0 NVIDIA T600 Off | 00000000:01:00.0 Off | N/A | | 61% 75C P0 N/A / 41W | 3324MiB / 4096MiB | 45% Default | ######################################################################################## So we know it does see both the GPUS I have on my server and they are indeed used to train the AI. Unfortunately, I am not able to provide a smaller snippet to reproduce the problem right now.

SuryanarayanaY commented 1 year ago

Thanks @occhialidaleso , I can see GPU is enabled and working fine.So we have to check the source of error.

Also the errors might be due to the reason that few ops are not working on GPU as they are not yet implemented on GPU.Could you also find any such ops in the code so that we can test those separately.I will also do it parallel.

Thankyou!

google-ml-butler[bot] commented 1 year ago

This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.

google-ml-butler[bot] commented 1 year ago

Closing as stale. Please reopen if you'd like to work on this further.

google-ml-butler[bot] commented 1 year ago

Are you satisfied with the resolution of your issue? Yes No