Closed alebldn closed 1 year ago
Hi @occhialidaleso , You can safely ignore the warning.The reasons for this are mentioned below:
WARNING:tensorflow:5 out of the last 5 calls to triggered tf.function retracing.
The reasons for this are mentioned below:
Tracing is expensive and the excessive number of tracings could be due to
(1) creating @tf.function repeatedly in a loop,
(2) passing tensors with different shapes,
(3) passing Python objects instead of tensors.
For (1), please define your @tf.function outside of the loop.
For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing.
For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing
and https://www.tensorflow.org/api_docs/python/tf/function for more details.
Kind sir, thanks for your reply. When using @tf.function on the gpu I already do ignore the warnings and it works fine, it does train decently. But, unfortunately, when commenting @tf.function the errors that show up don't allow me to execute the code further. (in the repo I posted two examples)
Hi @occhialidaleso , Could you please confirm whether you have followed all the steps mentioned here for enabling GPU support ?
Thankyou!
Sorry, I thought I mentioned it but I didn't: I just used the other method available in the page and downloaded the docker image tensorflow/tensorflow:latest-gpu-jupyter from dockerhub as shown here.
Hi @occhialidaleso ,
Thanks for confirmation.Could you please confirm the output of the below code.Just to ensure GPU installation is successful or not.
docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu \
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"
Also the attached code of you is very lengthy and debugging might be difficult and may take a long time.If possible could you please submit a minimal code snippet that can replicate your problem quickly to exactly address the issue.
Thankyou!
The output of the provided command is:
######################################################################################## 2022-12-05 11:52:16.442813: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2022-12-05 11:52:18.095820: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-05 11:52:18.096095: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-05 11:52:18.098533: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-05 11:52:18.098818: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-05 11:52:18.099018: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2022-12-05 11:52:18.099184: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')] ########################################################################################
The output of nvidia-smi during training is:
######################################################################################## | 0 NVIDIA T600 Off | 00000000:01:00.0 Off | N/A | | 61% 75C P0 N/A / 41W | 3324MiB / 4096MiB | 45% Default | ######################################################################################## So we know it does see both the GPUS I have on my server and they are indeed used to train the AI. Unfortunately, I am not able to provide a smaller snippet to reproduce the problem right now.
Thanks @occhialidaleso , I can see GPU is enabled and working fine.So we have to check the source of error.
Also the errors might be due to the reason that few ops are not working on GPU as they are not yet implemented on GPU.Could you also find any such ops in the code so that we can test those separately.I will also do it parallel.
Thankyou!
This issue has been automatically marked as stale because it has no recent activity. It will be closed if no further activity occurs. Thank you.
Closing as stale. Please reopen if you'd like to work on this further.
Click to expand!
### Issue Type Bug ### Source binary ### Tensorflow Version (v2.11.0-rc2-17-gd5b57ca93e5 2.11.0) But tried many different 2.11 versions ### Custom Code Yes ### OS Platform and Distribution Ubuntu 22.04.1 LTS ### Mobile device Ubuntu 22.04.1 LTS ### Python version 3.10 and 3.8 ### Bazel version _No response_ ### GCC/Compiler version _No response_ ### CUDA/cuDNN version cuda 11.0 ### GPU model and memory _No response_ ### Current Behaviour? ```shell This is a bug when interacting with the threading library while using GPUs tested in docker environment using the latest tensorflow-gpu build. I even tried to update python to the latest version. The code is an implementation of Asynchronous Advantage Actor-Critic playing with the Pong library. Even if there might be errors in the code, it works flawlessly when using CPUs. Switching to GPU brings up errors. Two possible outcomes, depending on commenting the @tf.function code at line 224 of class Worker: - If @tf.function is not commented, the code shows a warning: "WARNING:tensorflow:5 out of the last 5 calls to