Open goncinious opened 2 years ago
@goncinious,
Can you try using tensorflow/tensorflow:latest-gpu
docker image which uses stable version, instead of nightly
and let us know if this is still an issue? Thanks!
@sanatmpa1, maybe my description wasn't fully clear - all tests were performed using tensorflow/tensorflow:latest-gpu
image (see a10g.log
and t4_tesla.log
). I've only mentioned tensorflow/tensorflow:nightly-gpu
, as it gave an additional RESOURCE_EXHAUSTED
message (see a10g_tf_nightly.log
), which might be useful to pinpoint the issue.
@sachinprasadhs, please let me know if you need any more information about the issue found. If not, can you please change the label stat:awaiting response
of the issue, so it gets the right visibility?
@reedwm could you take a look?
Can you try again with the latest nightly-gpu
build? There's a decent chance https://github.com/tensorflow/tensorflow/pull/52337 fixed this, which was merged eight days ago.
Thanks @reedwm. I confirmed that the error still occurs on the latest nightly-build
(sha256:0a8364f4082cc51a1c2de05eb97e92e1d5d1ab8759387f0066d8ea700dec1b94
). Full log attached.
20211203_a10g_tf_nightly.log
With a Titan RTX, I can reproduce with CUDA 11.2 and cudnn 8.1.1. However, I cannot reproduce with CUDA 11.3 and cudnn 8.2.4. So presumably this will be fixed when TensorFlow upgrades CUDA and/or cudnn. Even if I limit the memory usage to 10 GB instead of 24 GB, it still runs with CUDA 11.3/cudnn 8.2.4, so there is clearly a memory issue with the earlier version of CUDA/cudnn.
@sanjoy, do you know when we plan on upgrading CUDA and cudnn? Is this planed for TF 2.8?
@awpr, any ideas what could be causing this, and know of any ways to fix it for CUDA 11.2? Seems strange that an algorithm requires 4.7 GB of memory.
@goncinious, as a temporary workaround, you can manually upgrade CUDA and cudnn if you know how. Understandably, this is difficult however.
Thanks @reedwm. I've tested with NVIDIA TensorRT official Docker images, which have newer versions of CUDA/cuDNN and I still hit the same error (see logs attached).
I've tested 2 image versions that match your CUDA/cuDNN versions - both 21.04 (CUDA 11.3 / cuDNN 8.2.0) and 21.09 (CUDA 11.4 / cuDNN 8.2.4).
docker pull nvcr.io/nvidia/tensorrt:21.04-py3
docker run -it --gpus all --rm -v /home/radyc/test.py:/srv/test.py nvcr.io/nvidia/tensorrt:21.04-py3 /bin/bash
pip install tensorflow
python /srv/test.py
Unfortunately, I still cannot reproduce with the newer CUDA/cudnn versions. I tried running the docker commands in your previous post on an A100 but could not reproduce, even when limiting the memory to 10GiB to try to reproduce the RESOURCE_EXHAUSTED
error. I'm guessing this only happens on GPUs with compute capability 8.6, but I don't have access to such GPUs.
@nluehr do you have access to GPUs with compute capability 8.6 that you can try to reproduce this issue on?
Reproduced on a GTX 3090 (compute capability 8.6, 24GB of memory) With CUDNN 8.2.0 (as provided in the referenced TRT container) I see the OOM for mixed_float16 while float32 runs without issue. Updating to CUDNN 8.3.1, I can run mixed_float16 and float32 without issue.
Ok, so this will be fixed for all tested GPUs in cudnn 8.3.1. @sanjoy, do we plan to update cudnn to at least 8.3.1 anytime soon?
Thank you both for looking into it. I can confirm that updating cuDNN manually in the TensorRT 21.04 container to 8.3.1 or using TensorRT 21.11 (which use CUDA 11.5 / cuDNN 8.3.0) fixes the issue.
I have two questions to @nluehr for further clarification:
Reg. Updating to CUDNN 8.3.1, I can run mixed_float16 and float32 without issue.
Did you also update CUDA here?
Nevertheless,I think this is still a workaround, as we need to manually update cuDNN version and this CUDA/cuDNN version combination isn't part of the TensorFlow's tested build configurations - https://www.tensorflow.org/install/source#gpu. Therefore, knowing if CUDA/cuDNN update in TensorFlow will happen (and having an ETA) would be very useful.
You can upgrade cudnn to newer minor versions (e.g., 8.3.0 over 8.2.x) without rebuilding TensorFlow.
I did not update CUDA in my tests. It is generally safe to use a cuDNN built against a later CUDA of the same CUDA major version. (e.g., you can use a cudnn built against CUDA 11.5 with CUDA 11.3). If you also update the CUDA toolkit, I believe you would need to rebuild TensorFlow.
As you point out, "generally works" and officially tested and supported are different things. If you are looking for TensorFlow containers built and tested with the latest cuDNN/CUDA combinations, you might check out the NGC TensorFlow releases.
@nluehr - Thanks for your reply.
After further investigation, I found that while upgrading cuDNN fixed the OOM issue observed with mixed precision on A10g GPUs (CC=8.6), the model output became non-deterministic when running on multi-GPU with mirrored strategy (i.e gives slightly different outputs every time I run it on the same input volume).
Crucially, I found that the output was deterministic when using 1-GPU only or when I switch to full precision, suggesting that something is broken with mixed precision when used on multi-GPUs with the latest compute capability.
Note that the behaviour was always deterministic when using mixed precision using GPUs with older compute capability (i.e Tesla T4, which have CC=7.5).
My tests were performed on the latest official TensorFlow GPU Docker image (v2.7.0), using the cuDNN upgrade solution as suggested. See steps below to reproduce the results obtained:
docker pull tensorflow/tensorflow:latest-gpu
Ubuntu20.04 x_86_64 (Deb)
file.test_identical.py
script by copying code from Colab.
$ docker run -it --gpus all --rm -v /home/radyc/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb:/srv/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb -v /home/radyc/test_identical.py:/srv/test_identical.py tensorflow/tensorflow:latest-gpu /bin/bash
dpkg -i /srv/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb
apt-key add /var/cudnn-local-repo-ubuntu2004-8.3.1.22/7fa2af80.pub
apt update
apt install libcudnn8=8.3.1.22-1+cuda11.5 -y
test_identical.py
(first time)
python /srv/test_identical.py
test_identical.py
again.
AssertionError
, as outputs will be different (see full log a10g_cudnn_updated_test_identical_4_gpu.log
attached).Follow the steps below to check that it works on 1-GPU:
rm /srv/output.npy /srv/model.h5
test_identical.py
forcing it to use 1-GPU (first time)
CUDA_VISIBLE_DEVICES=0 python /srv/test_identical.py
AssertionError
obtained (see full log a10g_cudnn_updated_test_identical_1_gpu.log
attached)a10g_cudnn_updated_test_identical_4_gpu.log a10g_cudnn_updated_test_identical_1_gpu.log
Determinism is not guaranteed by default, and as you observed, nondeterminism might only occur in specific cases. Running tf.config.experimental.enable_op_determinism()
should fix it, but note this is only available in the nightly builds and also will likely reduce performance.
Thank you for your reply, @reedwm.
I do understand why determinism is difficult to guarantee in training (e.g data sampling randomisation), but at inference it isn't that easy to understand, as the model and inputs are fixed.
Do you mind expanding a bit on the sources for these non-determinism at inference time? My guess is the split of both data and model operations across GPUs that might cause discrepancies, but more detail would be very helpful.
Do you mind expanding a bit on the sources for these non-determinism at inference time?
The split of data across GPUs can cause discrepancies, but these discrepancies can typically be removed by calling tf.keras.utils.set_random_seed
. A major source of nondeterminism comes from the fact that floating-point math is nonassociative, which means the order numbers are added can slightly affect the final result (unlike real numbers). GPU ops often use many threads to add numbers together, and so the order they are added is often nondeterministic.
Another source of nondeterminism comes from using a process in TensorFlow called "autotuning". For many ops, such as convolutions, there are multiple different algorithms that can be used to compute the op. For example, convolutions can be computed as FFTs, or using matrix multiplications, or with various other algorithms. With autotuning, TensorFlow tries each algorithm the first time the op is run, then uses the fastest algorithm for subsequent runs of the op. However, if multiple algorithms take approximately the same amount of time to run, it is nondeterministic which algorithm will be fastest, so the algorithm TensorFlow selects is nondeterministic. Different algoirthms may have slightly different results on the same inputs, so autotuning can cause nondeterminism. Autotuning is disabled with tf.config.experimental.enable_op_determinism()
Thanks @reedwm for you insights on GPU determinism - that was very useful.
Previously, I found that cuDNN upgrade fixed the OOM issue observed with mixed precision on A10g GPUs (CC=8.6) (see https://github.com/keras-team/tf-keras/issues/125), however, I then found it breaks on a GPU with older compute capability (NVIDIA Tesla T4, CC=7.5) with a RESOURCE_EXHAUSTED
error (see error below).
In summary:
2022-03-07 10:24:02.868192: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_grad_ops_3d.cc:1514 : NOT_FOUND: No algorithm worked! Error messages:
Profiling failure on CUDNN engine 1#TC: RESOURCE_EXHAUSTED: Allocating 4718624784 bytes exceeds the memory limit of 4294967296 bytes.
Profiling failure on CUDNN engine 1: RESOURCE_EXHAUSTED: Allocating 4718624784 bytes exceeds the memory limit of 4294967296 bytes.
Traceback (most recent call last):
File "test.py", line 95, in <module>
test_model()
File "test.py", line 21, in test_model
model.predict(test_input, verbose=1, batch_size=1)
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.NotFoundError: Graph execution error:
Detected at node 'model/conv3d_transpose_3/conv3d_transpose' defined at (most recent call last):
File "test.py", line 95, in <module>
test_model()
File "test.py", line 21, in test_model
model.predict(test_input, verbose=1, batch_size=1)
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1982, in predict
tmp_batch_outputs = self.predict_function(iterator)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1801, in predict_function
return step_function(self, iterator)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1790, in step_function
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1783, in run_step
outputs = model.predict_step(data)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1751, in predict_step
return self(x, training=False)
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 1096, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 92, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/functional.py", line 451, in call
return self._run_internal_graph(
File "/usr/local/lib/python3.8/dist-packages/keras/engine/functional.py", line 589, in _run_internal_graph
outputs = node.layer(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 1096, in __call__
outputs = call_fn(inputs, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 92, in error_handler
return fn(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/keras/layers/convolutional.py", line 1648, in call
outputs = tf.nn.conv3d_transpose(
Node: 'model/conv3d_transpose_3/conv3d_transpose'
No algorithm worked! Error messages:
Profiling failure on CUDNN engine 1#TC: RESOURCE_EXHAUSTED: Allocating 4718624784 bytes exceeds the memory limit of 4294967296 bytes.
Profiling failure on CUDNN engine 1: RESOURCE_EXHAUSTED: Allocating 4718624784 bytes exceeds the memory limit of 4294967296 bytes.
[[{{node model/conv3d_transpose_3/conv3d_transpose}}]] [Op:__inference_predict_function_1150]
My tests were performed on the latest official TensorFlow GPU Docker image (v2.8.0), using the cuDNN upgrade solution as suggested. See steps below to reproduce the results obtained:
docker pull tensorflow/tensorflow:latest-gpu
Ubuntu20.04 x_86_64 (Deb)
file.test.py
script by copying code from Colab.$ docker run -it --gpus all --rm -v /home/radyc/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb:/srv/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb -v /home/radyc/test.py:/srv/test.py tensorflow/tensorflow:latest-gpu /bin/bash
dpkg -i /srv/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb
apt-key add /var/cudnn-local-repo-ubuntu2004-8.3.1.22/7fa2af80.pub
apt update
apt install libcudnn8=8.3.1.22-1+cuda11.5 -y
test.py
python /srv/test.py
The error is because you are running out of memory. It's possible future cuDNN versions use more memory, although the overall memory usage of the model should not significantly increase.
When you got the RESOURCE_EXHAUSTED
, did you have determinism enabled? Unfortunately, determinism can cause a lot more memory to be used, and the extra memory used can great vary depending on the version of cuDNN.
Thanks - no, determinism isn't enabled on the script I'm using, so cuDNN >=8.3 seems to be using more memory than before.
Given that I'm restricted to use this model size and input shape on Tesla T4 GPU (16GB) - is there any options that I could try to reduce the memory usage?
@nluehr, @awpr, any ideas why cuDNN 8.3 is using more memory than 8.1 on certain GPUs (despite using less on others)? I think the frontend API is not being used in either case, since TF is still compiled with cudnn 8.1, so it's not due to the frontend API. It's suspicious that 4718624784 bytes (4.7 GB) is being allocated.
@goncinious the only advice I have is to try reducing the batch size. If training, tf.recompute_grad
can help memory usage, although I haven't personally used it.
I don't know why cuDNN 8.3 would use more memory, but it might be informative to check what algorithm and how much scratch space 8.1 was using for the same op, to see whether the newer version is using more scratch space for the same algorithms or is no longer able to use a different less memory-hungry algorithm -- that would help narrow down whether the issue is with over-allocating memory or with breaking/removing an algorithm we were previously relying on.
I can repro the errors found in https://github.com/keras-team/tf-keras/issues/125.
I think this is an issue caused by the heuristics of cudnn, which keeps updating from version to version and also returns different results between platforms. So, I would suggest to update to the latest cudnn as you have already done.
Then, you can try either of the two ways to work around the issue:
TF_CUDNN_USE_FRONTEND=1 python test.py
.TF_CUDNN_WORKSPACE_LIMIT_IN_MB=5000 python test.py
.The weird thing about this issue is that the algorithm that requires 4718624784 bytes workspace should be skipped in the first place before even attempting the allocation, since it exceeds the default max limit of 4GB. I am still investigating the root cause. But I think the above should be sufficient to help in this case. Please let me know if that works for you. @goncinious
I think the root cause is the workspace used in the CUDNN_CONVOLUTION_BWD_DATA_ALGO_1 varies as below: |
CUDNN | T4 | 3090 |
---|---|---|---|
8.1.0.77 | 131MB | 4.7GB | |
8.3.2.44 | 4.7GB | 0 |
Since this algo is the only one that works for this conv case and we have a 4GB max limit for the allocator, the two "4.7GB" cases will simply fail which matches your observation in https://github.com/keras-team/tf-keras/issues/125.
I will file a bug to our cudnn team. And on your side, please try the above WARs for now. Thanks.
@kaixih, thank you very much for investigating and finding what it looks like the root cause.
I can confirm that both solutions (TF_CUDNN_USE_FRONTEND=1
and TF_CUDNN_WORKSPACE_LIMIT_IN_MB=5000
) fix the issue reported.
Note that I noticed a significant difference in the initialisation time between the two methods (hangs a bit after tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8301
), where TF_CUDNN_USE_FRONTEND
is slower than TF_CUDNN_WORKSPACE_LIMIT_IN_MB
, see:
Running time python test.py
:
TF_CUDNN_USE_FRONTEND
5/5 [==============================] - 35s 539ms/step
real 0m42.214s
user 0m39.690s
sys 0m4.990s
TF_CUDNN_WORKSPACE_LIMIT_IN_MB
5/5 [==============================] - 9s 558ms/step
real 0m16.804s
user 0m14.353s
sys 0m4.901s
I guess the new API takes longer to load than the older one, but is it expected? Do you know in which version of cuDNN the new frontend API will become "default"?
I will file a bug to our cudnn team. And on your side, please try the above WARs for now. Thanks.
Thank you. Will the ticket be available somewhere I can access? This would be useful, so I can track progress on it as well.
Yes, by using the frontend API, the warmup usually takes longer since more engines are exposed than the previous algorithms and we need longer time on sweeping over them. But after the autotuning phase, the frontend API should be faster or at least equal to the old APIs. If not, there is a bug.
Do you know in which version of cuDNN the new frontend API will become "default"?
I believe the frontend API will become default when the TF is built over cudnn 8.2 or later. And we (NVIDIA) recommend using the frontend API and updating the cudnn to the latest version.
Will the ticket be available somewhere I can access? This would be useful, so I can track progress on it as well.
I have already created the bug ticket, but, it is internal to NVIDIA. Sorry for that. I think I can update this thread when I get some feedback. Btw, can you please share what is the use case in high level? Is that a real model or just some benchmarking? Thanks.
Thanks a lot for the very quick turnover and for the informative answers!
I found a use case where the solution proposed with the two env variables differ w.r.t to the number of blocks fed to model.predict() on a Tesla T4 GPU:
TF_CUDNN_USE_FRONTEND
works while TF_CUDNN_WORKSPACE_LIMIT_IN_MB
failsThe latter case with TF_CUDNN_WORKSPACE_LIMIT_IN_MB=5000
fails with OOM error (see full log attached):
Node: 'model/conv3d_transpose_3/conv3d_transpose'
No algorithm worked! Error messages:
Profiling failure on CUDNN engine 1#TC: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 4735402000 bytes.
Profiling failure on CUDNN engine 1: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 4735402000 bytes.
[[{{node model/conv3d_transpose_3/conv3d_transpose}}]] [Op:__inference_predict_function_1150]
It can be reproduced by following the same steps as in https://github.com/keras-team/tf-keras/issues/125 and replace the number of blocks 5->12 (line 18) in the Colab.
Not sure if this adds more information to the issue found, but wanted to know what do you think why it fails with TF_CUDNN_WORKSPACE_LIMIT_IN_MB
and not with TF_CUDNN_USE_FRONTEND
? More generally, since each block should be processed sequentially by the GPU (with batch size=1), do we expect the memory usage to increase with the #blocks?
I have already created the bug ticket, but, it is internal to NVIDIA. Sorry for that. I think I can update this thread when I get some feedback.
Thank you - updating it here should be fine.
Btw, can you please share what is the use case in high level? Is that a real model or just some benchmarking? Thanks.
Sure, it's real clinical use case of a 3D U-net for segmentation of a large organ from a CT scan as input. The CT is too large to fit in GPU memory, so the input is first split into large blocks (each of shape 320^3 voxels), which are then fed to the model. Having a large block is important here, as we want to capture as much context as possible.
Thanks for sharing the info.
More generally, since each block should be processed sequentially by the GPU (with batch size=1), do we expect the memory usage to increase with the #blocks?
Yes, more blocks mean you use more layers and more weights will stay in the GPU memory. As mentioned in https://github.com/keras-team/tf-keras/issues/125, we generally recommend users to switch to the frontend API.
Sorry if I didn't explain that well - by blocks I meant the number of inputs that are passed to model.predict()
while keeping batch size = 1 (e.g., if I'm feeding a 10x320^3x1, I except the model to process each 1x320^3x1 input 10 times, one at a time). This issue has also been observed in https://github.com/tensorflow/tensorflow/issues/40547.
Ah, I see. I only noticed the depth = 5
and thought you were talking about the blocking blocks.
Anyway, in this case, I think the size of the model's weights should be constant. And I actually tried your colab code by modifying:
test_input = np.ones(shape=(15, *input_shape), dtype=np.float32)
with CUDA_VISIBLE_DEVICES=4 TF_CUDNN_USE_FRONTEND=0 TF_CUDNN_WORKSPACE_LIMIT_IN_MB=5000
on a T4-16GB GPU. Despite one warning, the execution works fine:
1/15 [=>............................] - ETA: 1:392022-03-15 01:02:39.071383: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.47GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
15/15 [==============================] - 19s 822ms/step
I believe you haven't upgraded cuDNN to >= 8.3 before running the script. However, we do need to upgrade it, so inference can work without OOM errors in both both T4 Tesla (compute capability=7.5) and A10g GPUs (compute capability=8.6). To upgrade cuDNN, I'm using steps 3-5 in https://github.com/keras-team/tf-keras/issues/125. If you first upgrade cuDNN and then run the script with the #inputs=15 (as you did), you should be able to reproduce my findings.
A large U-Net 3D model configured with mixed precision fails with
No algorithm worked!
(see fulla10g.log
attached) when running inference on a NVIDIA A10G 20GB GPU (compute capability 8.6).Using
tensorflow/tensorflow:nightly-gpu
Docker image, the error points to an out-of-memory issue (see full loga10g_tf_nightly.log
attached):I'm able to overcome the issue by using full precision instead (i.e by setting
mixed_precision.set_global_policy("float32")
.The same model configured with mixed precision works fine on the previous generation T4 Tesla GPU (compute capability 7.5), which have even less GPU memory - 16GB (see full
t4_tesla.log
attached).System information
tensorflow:latest-gpu
Docker image (sha256@fc5eb0604722c7bef7b499bb007b3050c4beec5859c2e0d4409d2cca5c14d442
)nvidia-smi
outputs for both GPU types provided in attachments.Describe the expected behavior
Mixed precision mode should not exhaust all GPU memory on the newest generation of NVIDIA A10G.
Standalone code to reproduce the issue Steps to reproduce:
Start instance with A10G GPU
Start interactive Docker container and pass
test.py
(copy from Colab)Run script
Repeat steps using Tesla T4 (no error obtained)
Other info / logs a10g.log a10g_tf_nightly.log t4_tesla.log
a10g_nvidia_smi.log t4_tesla_nvidia_smi.log