Mixed precision doesn't work properly on NVIDIA A10G GPUs

goncinious commented 2 years ago

A large U-Net 3D model configured with mixed precision fails with No algorithm worked! (see full a10g.log attached) when running inference on a NVIDIA A10G 20GB GPU (compute capability 8.6).

Using tensorflow/tensorflow:nightly-gpu Docker image, the error points to an out-of-memory issue (see full log a10g_tf_nightly.log attached):

No algorithm worked!  Error messages:
  Profiling failure on CUDNN engine 1#TC: RESOURCE_EXHAUSTED: Allocating 4718624784 bytes exceeds the memory limit of 4294967296 bytes.
  Profiling failure on CUDNN engine 1: RESOURCE_EXHAUSTED: Allocating 4718624784 bytes exceeds the memory limit of 4294967296 bytes.
         [[{{node model/conv3d_transpose_3/conv3d_transpose}}]] [Op:__inference_predict_function_1150]

I'm able to overcome the issue by using full precision instead (i.e by setting mixed_precision.set_global_policy("float32").

The same model configured with mixed precision works fine on the previous generation T4 Tesla GPU (compute capability 7.5), which have even less GPU memory - 16GB (see full t4_tesla.log attached).

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 20.04.3 LTS (GNU/Linux 5.11.0-1019-aws x86_64)
TensorFlow installed from (source or binary): Official tensorflow:latest-gpu Docker image (sha256@fc5eb0604722c7bef7b499bb007b3050c4beec5859c2e0d4409d2cca5c14d442)
TensorFlow version (use command below): 2.7.0
Python version: 3.8.10
CUDA/cuDNN version: 11.2.1 / 8.1.0.77-1
GPU model and memory: A10G (20GB) and Tesla T4 (16GB).
NVIDIA driver: 470.82
nvidia-smi outputs for both GPU types provided in attachments.

Describe the expected behavior

Mixed precision mode should not exhaust all GPU memory on the newest generation of NVIDIA A10G.

Standalone code to reproduce the issue Steps to reproduce:

Start instance with A10G GPU

Start interactive Docker container and pass test.py (copy from Colab)

$ docker run --gpus all -v /path/to/test.py:/srv/test.py -it tensorflow/tensorflow:latest-gpu /bin/bash

Run script
```
python /srv/test.py
```
Repeat steps using Tesla T4 (no error obtained)

Other info / logs a10g.log a10g_tf_nightly.log t4_tesla.log

a10g_nvidia_smi.log t4_tesla_nvidia_smi.log

sanatmpa1 commented 2 years ago

@goncinious,

Can you try using tensorflow/tensorflow:latest-gpu docker image which uses stable version, instead of nightly and let us know if this is still an issue? Thanks!

goncinious commented 2 years ago

@sanatmpa1, maybe my description wasn't fully clear - all tests were performed using tensorflow/tensorflow:latest-gpu image (see a10g.log and t4_tesla.log). I've only mentioned tensorflow/tensorflow:nightly-gpu, as it gave an additional RESOURCE_EXHAUSTED message (see a10g_tf_nightly.log), which might be useful to pinpoint the issue.

goncinious commented 2 years ago

@sachinprasadhs, please let me know if you need any more information about the issue found. If not, can you please change the label stat:awaiting response of the issue, so it gets the right visibility?

mattdangerw commented 2 years ago

@reedwm could you take a look?

reedwm commented 2 years ago

Can you try again with the latest nightly-gpu build? There's a decent chance https://github.com/tensorflow/tensorflow/pull/52337 fixed this, which was merged eight days ago.

goncinious commented 2 years ago

Thanks @reedwm. I confirmed that the error still occurs on the latest nightly-build(sha256:0a8364f4082cc51a1c2de05eb97e92e1d5d1ab8759387f0066d8ea700dec1b94). Full log attached. 20211203_a10g_tf_nightly.log

reedwm commented 2 years ago

With a Titan RTX, I can reproduce with CUDA 11.2 and cudnn 8.1.1. However, I cannot reproduce with CUDA 11.3 and cudnn 8.2.4. So presumably this will be fixed when TensorFlow upgrades CUDA and/or cudnn. Even if I limit the memory usage to 10 GB instead of 24 GB, it still runs with CUDA 11.3/cudnn 8.2.4, so there is clearly a memory issue with the earlier version of CUDA/cudnn.

@sanjoy, do you know when we plan on upgrading CUDA and cudnn? Is this planed for TF 2.8?

@awpr, any ideas what could be causing this, and know of any ways to fix it for CUDA 11.2? Seems strange that an algorithm requires 4.7 GB of memory.

@goncinious, as a temporary workaround, you can manually upgrade CUDA and cudnn if you know how. Understandably, this is difficult however.

goncinious commented 2 years ago

Thanks @reedwm. I've tested with NVIDIA TensorRT official Docker images, which have newer versions of CUDA/cuDNN and I still hit the same error (see logs attached).

I've tested 2 image versions that match your CUDA/cuDNN versions - both 21.04 (CUDA 11.3 / cuDNN 8.2.0) and 21.09 (CUDA 11.4 / cuDNN 8.2.4).

Steps to reproduce

Pull TensorRT image

docker pull nvcr.io/nvidia/tensorrt:21.04-py3

Start interactive Docker container and pass test.py (copy from Colab)

docker run -it --gpus all --rm -v /home/radyc/test.py:/srv/test.py nvcr.io/nvidia/tensorrt:21.04-py3 /bin/bash

Install Tensorflow (2.7.0 gets installed)
```
pip install tensorflow
```
Run script
```
python /srv/test.py
```
Repeat with 21.09 image

Logs

a10g_cuda11-3_cudnn8-2-0.log a10g_cuda11-4_cudnn8-2-4.log

reedwm commented 2 years ago

Unfortunately, I still cannot reproduce with the newer CUDA/cudnn versions. I tried running the docker commands in your previous post on an A100 but could not reproduce, even when limiting the memory to 10GiB to try to reproduce the RESOURCE_EXHAUSTED error. I'm guessing this only happens on GPUs with compute capability 8.6, but I don't have access to such GPUs.

@nluehr do you have access to GPUs with compute capability 8.6 that you can try to reproduce this issue on?

nluehr commented 2 years ago

Reproduced on a GTX 3090 (compute capability 8.6, 24GB of memory) With CUDNN 8.2.0 (as provided in the referenced TRT container) I see the OOM for mixed_float16 while float32 runs without issue. Updating to CUDNN 8.3.1, I can run mixed_float16 and float32 without issue.

reedwm commented 2 years ago

Ok, so this will be fixed for all tested GPUs in cudnn 8.3.1. @sanjoy, do we plan to update cudnn to at least 8.3.1 anytime soon?

goncinious commented 2 years ago

Thank you both for looking into it. I can confirm that updating cuDNN manually in the TensorRT 21.04 container to 8.3.1 or using TensorRT 21.11 (which use CUDA 11.5 / cuDNN 8.3.0) fixes the issue.

I have two questions to @nluehr for further clarification:

when you modify the version of cuDNN, do you just upgrade the library in the environment, or do you also need to recompile Tensorflow with the corresponding CUDA/cuDNN version?
Reg. Updating to CUDNN 8.3.1, I can run mixed_float16 and float32 without issue.

Did you also update CUDA here?

Nevertheless,I think this is still a workaround, as we need to manually update cuDNN version and this CUDA/cuDNN version combination isn't part of the TensorFlow's tested build configurations - https://www.tensorflow.org/install/source#gpu. Therefore, knowing if CUDA/cuDNN update in TensorFlow will happen (and having an ETA) would be very useful.

nluehr commented 2 years ago

You can upgrade cudnn to newer minor versions (e.g., 8.3.0 over 8.2.x) without rebuilding TensorFlow.

I did not update CUDA in my tests. It is generally safe to use a cuDNN built against a later CUDA of the same CUDA major version. (e.g., you can use a cudnn built against CUDA 11.5 with CUDA 11.3). If you also update the CUDA toolkit, I believe you would need to rebuild TensorFlow.

As you point out, "generally works" and officially tested and supported are different things. If you are looking for TensorFlow containers built and tested with the latest cuDNN/CUDA combinations, you might check out the NGC TensorFlow releases.

goncinious commented 2 years ago

@nluehr - Thanks for your reply.

After further investigation, I found that while upgrading cuDNN fixed the OOM issue observed with mixed precision on A10g GPUs (CC=8.6), the model output became non-deterministic when running on multi-GPU with mirrored strategy (i.e gives slightly different outputs every time I run it on the same input volume).

Crucially, I found that the output was deterministic when using 1-GPU only or when I switch to full precision, suggesting that something is broken with mixed precision when used on multi-GPUs with the latest compute capability.

Note that the behaviour was always deterministic when using mixed precision using GPUs with older compute capability (i.e Tesla T4, which have CC=7.5).

Steps to reproduce

My tests were performed on the latest official TensorFlow GPU Docker image (v2.7.0), using the cuDNN upgrade solution as suggested. See steps below to reproduce the results obtained:

Start instance with 4 NVIDIA A10g GPUs
Download latest TensorFlow Docker image (TensorFlow v2.7.0)
```
docker pull tensorflow/tensorflow:latest-gpu
```
Download cuDNN
- Follow https://developer.nvidia.com/cudnn-download-survey and download cuDNN v8.3.1 (for CUDA 11.5) - Ubuntu20.04 x_86_64 (Deb) file.

Initialise container, passing cuDNN installer and test script

Get test_identical.py script by copying code from Colab.

$ docker run -it --gpus all --rm -v /home/radyc/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb:/srv/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb -v /home/radyc/test_identical.py:/srv/test_identical.py tensorflow/tensorflow:latest-gpu /bin/bash

Upgrade cuDNN within the container

As per official instructions - https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-deb

dpkg -i /srv/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb
apt-key add /var/cudnn-local-repo-ubuntu2004-8.3.1.22/7fa2af80.pub
apt update
apt install libcudnn8=8.3.1.22-1+cuda11.5 -y

Run test_identical.py (first time)
- This will save the initialised model and save the first prediction
```
python /srv/test_identical.py
```
Run test_identical.py again.
- Loads the previously initialised model, runs prediction on the same volume and compares with output obtained in step 6
- Here, you should get an AssertionError, as outputs will be different (see full log a10g_cudnn_updated_test_identical_4_gpu.log attached).

Follow the steps below to check that it works on 1-GPU:

remove previous model and output
```
rm /srv/output.npy /srv/model.h5
```
run test_identical.py forcing it to use 1-GPU (first time)
```
CUDA_VISIBLE_DEVICES=0 python /srv/test_identical.py
```
repeat step 2.
- No AssertionError obtained (see full log a10g_cudnn_updated_test_identical_1_gpu.log attached)

Logs

a10g_cudnn_updated_test_identical_4_gpu.log a10g_cudnn_updated_test_identical_1_gpu.log

reedwm commented 2 years ago

Determinism is not guaranteed by default, and as you observed, nondeterminism might only occur in specific cases. Running tf.config.experimental.enable_op_determinism() should fix it, but note this is only available in the nightly builds and also will likely reduce performance.

goncinious commented 2 years ago

Thank you for your reply, @reedwm.

I do understand why determinism is difficult to guarantee in training (e.g data sampling randomisation), but at inference it isn't that easy to understand, as the model and inputs are fixed.

Do you mind expanding a bit on the sources for these non-determinism at inference time? My guess is the split of both data and model operations across GPUs that might cause discrepancies, but more detail would be very helpful.

reedwm commented 2 years ago

Do you mind expanding a bit on the sources for these non-determinism at inference time?

The split of data across GPUs can cause discrepancies, but these discrepancies can typically be removed by calling tf.keras.utils.set_random_seed. A major source of nondeterminism comes from the fact that floating-point math is nonassociative, which means the order numbers are added can slightly affect the final result (unlike real numbers). GPU ops often use many threads to add numbers together, and so the order they are added is often nondeterministic.

Another source of nondeterminism comes from using a process in TensorFlow called "autotuning". For many ops, such as convolutions, there are multiple different algorithms that can be used to compute the op. For example, convolutions can be computed as FFTs, or using matrix multiplications, or with various other algorithms. With autotuning, TensorFlow tries each algorithm the first time the op is run, then uses the fastest algorithm for subsequent runs of the op. However, if multiple algorithms take approximately the same amount of time to run, it is nondeterministic which algorithm will be fastest, so the algorithm TensorFlow selects is nondeterministic. Different algoirthms may have slightly different results on the same inputs, so autotuning can cause nondeterminism. Autotuning is disabled with tf.config.experimental.enable_op_determinism()

goncinious commented 2 years ago

Thanks @reedwm for you insights on GPU determinism - that was very useful.

Previously, I found that cuDNN upgrade fixed the OOM issue observed with mixed precision on A10g GPUs (CC=8.6) (see https://github.com/keras-team/tf-keras/issues/125), however, I then found it breaks on a GPU with older compute capability (NVIDIA Tesla T4, CC=7.5) with a RESOURCE_EXHAUSTED error (see error below).

In summary:

with cuDNN < 8.3 works on Tesla T4 (CC=7.5) but not on A10g (CC=8.6)
with cuDNN >= 8.3 works on A10g (CC=8.6) but not on Tesla T4 (CC=7.5)

2022-03-07 10:24:02.868192: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at conv_grad_ops_3d.cc:1514 : NOT_FOUND: No algorithm worked!  Error messages:                                                                                          
  Profiling failure on CUDNN engine 1#TC: RESOURCE_EXHAUSTED: Allocating 4718624784 bytes exceeds the memory limit of 4294967296 bytes.                                                                                                                               
  Profiling failure on CUDNN engine 1: RESOURCE_EXHAUSTED: Allocating 4718624784 bytes exceeds the memory limit of 4294967296 bytes.                                                                                                                                  
Traceback (most recent call last):                                                                   
  File "test.py", line 95, in <module>  
    test_model()  
  File "test.py", line 21, in test_model  
    model.predict(test_input, verbose=1, batch_size=1)  
  File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 67, in error_handler  
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/eager/execute.py", line 54, in quick_execute                                                                                                                                                         
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.NotFoundError: Graph execution error:                                                                                                                              

Detected at node 'model/conv3d_transpose_3/conv3d_transpose' defined at (most recent call last):                                                                                                           
    File "test.py", line 95, in <module>                                                             
      test_model()                                                                                   
    File "test.py", line 21, in test_model
      model.predict(test_input, verbose=1, batch_size=1)                                                                                                                                                   
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler                                                                                                                                                           
      return fn(*args, **kwargs)                                                                                                                                                                           
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1982, in predict                                  
      tmp_batch_outputs = self.predict_function(iterator)                                           
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1801, in predict_function                                                                                                                                                            
      return step_function(self, iterator)                                                                                                                                                                 
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1790, in step_function                                                                                                                                                               
      outputs = model.distribute_strategy.run(run_step, args=(data,))                          
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1783, in run_step                                 
      outputs = model.predict_step(data)                                                                                                                                                                   
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/training.py", line 1751, in predict_step                                                                                                                                                                
      return self(x, training=False)                                                                                                                                                                       
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler                                                                                                                                                           
      return fn(*args, **kwargs)                                                                     
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 1096, in __call__                               
      outputs = call_fn(inputs, *args, **kwargs)                                                                                                                                                           
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 92, in error_handler                                                                                                                                                           
      return fn(*args, **kwargs)                                                                     
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/functional.py", line 451, in call                                    
      return self._run_internal_graph(           
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/functional.py", line 589, in _run_internal_graph                                                                                                                                                        
      outputs = node.layer(*args, **kwargs)                                                                                                                                                                
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler                         
      return fn(*args, **kwargs)                                                                                                                                                                           
    File "/usr/local/lib/python3.8/dist-packages/keras/engine/base_layer.py", line 1096, in __call__                               
      outputs = call_fn(inputs, *args, **kwargs)                 
    File "/usr/local/lib/python3.8/dist-packages/keras/utils/traceback_utils.py", line 92, in error_handler                                                                                                                                                           
      return fn(*args, **kwargs)                                 
    File "/usr/local/lib/python3.8/dist-packages/keras/layers/convolutional.py", line 1648, in call                                
      outputs = tf.nn.conv3d_transpose(                          
Node: 'model/conv3d_transpose_3/conv3d_transpose'                
No algorithm worked!  Error messages:                            
  Profiling failure on CUDNN engine 1#TC: RESOURCE_EXHAUSTED: Allocating 4718624784 bytes exceeds the memory limit of 4294967296 bytes.                                                                                                                               
  Profiling failure on CUDNN engine 1: RESOURCE_EXHAUSTED: Allocating 4718624784 bytes exceeds the memory limit of 4294967296 bytes.                                                                                                                                  
         [[{{node model/conv3d_transpose_3/conv3d_transpose}}]] [Op:__inference_predict_function_1150]

Steps to reproduce

My tests were performed on the latest official TensorFlow GPU Docker image (v2.8.0), using the cuDNN upgrade solution as suggested. See steps below to reproduce the results obtained:

Start instance with 4 NVIDIA Tesla T4 GPUs
Download latest TensorFlow Docker image (TensorFlow v2.8.0)

docker pull tensorflow/tensorflow:latest-gpu

Download cuDNN

Follow https://developer.nvidia.com/cudnn-download-survey and download cuDNN v8.3.1 (for CUDA 11.5) - Ubuntu20.04 x_86_64 (Deb) file.

Initialise container, passing cuDNN installer and test script

Get test.py script by copying code from Colab.

$ docker run -it --gpus all --rm -v /home/radyc/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb:/srv/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb -v /home/radyc/test.py:/srv/test.py tensorflow/tensorflow:latest-gpu /bin/bash

Upgrade cuDNN within the container

As per official instructions - https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-deb

dpkg -i /srv/cudnn-local-repo-ubuntu2004-8.3.1.22_1.0-1_amd64.deb
apt-key add /var/cudnn-local-repo-ubuntu2004-8.3.1.22/7fa2af80.pub
apt update
apt install libcudnn8=8.3.1.22-1+cuda11.5 -y

Run test.py

python /srv/test.py

Logs

220307-t4-cudnn83.log

reedwm commented 2 years ago

The error is because you are running out of memory. It's possible future cuDNN versions use more memory, although the overall memory usage of the model should not significantly increase.

When you got the RESOURCE_EXHAUSTED, did you have determinism enabled? Unfortunately, determinism can cause a lot more memory to be used, and the extra memory used can great vary depending on the version of cuDNN.

goncinious commented 2 years ago

Thanks - no, determinism isn't enabled on the script I'm using, so cuDNN >=8.3 seems to be using more memory than before.

Given that I'm restricted to use this model size and input shape on Tesla T4 GPU (16GB) - is there any options that I could try to reduce the memory usage?

reedwm commented 2 years ago

@nluehr, @awpr, any ideas why cuDNN 8.3 is using more memory than 8.1 on certain GPUs (despite using less on others)? I think the frontend API is not being used in either case, since TF is still compiled with cudnn 8.1, so it's not due to the frontend API. It's suspicious that 4718624784 bytes (4.7 GB) is being allocated.

@goncinious the only advice I have is to try reducing the batch size. If training, tf.recompute_grad can help memory usage, although I haven't personally used it.

awpr commented 2 years ago

I don't know why cuDNN 8.3 would use more memory, but it might be informative to check what algorithm and how much scratch space 8.1 was using for the same op, to see whether the newer version is using more scratch space for the same algorithms or is no longer able to use a different less memory-hungry algorithm -- that would help narrow down whether the issue is with over-allocating memory or with breaking/removing an algorithm we were previously relying on.

kaixih commented 2 years ago

I can repro the errors found in https://github.com/keras-team/tf-keras/issues/125.

I think this is an issue caused by the heuristics of cudnn, which keeps updating from version to version and also returns different results between platforms. So, I would suggest to update to the latest cudnn as you have already done.

Then, you can try either of the two ways to work around the issue:

You can switch to the frontend API which is a new cudnn API that provides some finer grained control over the conv algorithms (we call them engines now). To do that, use TF_CUDNN_USE_FRONTEND=1 python test.py.
Or, you can simply set a larger workspace size. To do that use TF_CUDNN_WORKSPACE_LIMIT_IN_MB=5000 python test.py.

The weird thing about this issue is that the algorithm that requires 4718624784 bytes workspace should be skipped in the first place before even attempting the allocation, since it exceeds the default max limit of 4GB. I am still investigating the root cause. But I think the above should be sufficient to help in this case. Please let me know if that works for you. @goncinious

kaixih commented 2 years ago

I think the root cause is the workspace used in the `CUDNN_CONVOLUTION_BWD_DATA_ALGO_1` varies as below:	CUDNN	T4	3090
8.1.0.77	131MB	4.7GB
8.3.2.44	4.7GB	0

Since this algo is the only one that works for this conv case and we have a 4GB max limit for the allocator, the two "4.7GB" cases will simply fail which matches your observation in https://github.com/keras-team/tf-keras/issues/125.

I will file a bug to our cudnn team. And on your side, please try the above WARs for now. Thanks.

goncinious commented 2 years ago

@kaixih, thank you very much for investigating and finding what it looks like the root cause.

I can confirm that both solutions (TF_CUDNN_USE_FRONTEND=1 and TF_CUDNN_WORKSPACE_LIMIT_IN_MB=5000) fix the issue reported.

Note that I noticed a significant difference in the initialisation time between the two methods (hangs a bit after tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8301), where TF_CUDNN_USE_FRONTEND is slower than TF_CUDNN_WORKSPACE_LIMIT_IN_MB, see:

Running time python test.py:

TF_CUDNN_USE_FRONTEND

5/5 [==============================] - 35s 539ms/step

real    0m42.214s
user    0m39.690s
sys     0m4.990s

TF_CUDNN_WORKSPACE_LIMIT_IN_MB

5/5 [==============================] - 9s 558ms/step

real    0m16.804s
user    0m14.353s
sys     0m4.901s

I guess the new API takes longer to load than the older one, but is it expected? Do you know in which version of cuDNN the new frontend API will become "default"?

I will file a bug to our cudnn team. And on your side, please try the above WARs for now. Thanks.

Thank you. Will the ticket be available somewhere I can access? This would be useful, so I can track progress on it as well.

kaixih commented 2 years ago

Yes, by using the frontend API, the warmup usually takes longer since more engines are exposed than the previous algorithms and we need longer time on sweeping over them. But after the autotuning phase, the frontend API should be faster or at least equal to the old APIs. If not, there is a bug.

Do you know in which version of cuDNN the new frontend API will become "default"?

I believe the frontend API will become default when the TF is built over cudnn 8.2 or later. And we (NVIDIA) recommend using the frontend API and updating the cudnn to the latest version.

Will the ticket be available somewhere I can access? This would be useful, so I can track progress on it as well.

I have already created the bug ticket, but, it is internal to NVIDIA. Sorry for that. I think I can update this thread when I get some feedback. Btw, can you please share what is the use case in high level? Is that a real model or just some benchmarking? Thanks.

goncinious commented 2 years ago

Thanks a lot for the very quick turnover and for the informative answers!

I found a use case where the solution proposed with the two env variables differ w.r.t to the number of blocks fed to model.predict() on a Tesla T4 GPU:

When #blocks < 12 -> both methods works (same as in initial setup, as #blocks was set to 5)
When #blocks >= 12 -> TF_CUDNN_USE_FRONTEND works while TF_CUDNN_WORKSPACE_LIMIT_IN_MB fails

The latter case with TF_CUDNN_WORKSPACE_LIMIT_IN_MB=5000 fails with OOM error (see full log attached):

Node: 'model/conv3d_transpose_3/conv3d_transpose'

No algorithm worked! Error messages:

Profiling failure on CUDNN engine 1#TC: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 4735402000 bytes.

Profiling failure on CUDNN engine 1: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 4735402000 bytes.

[[{{node model/conv3d_transpose_3/conv3d_transpose}}]] [Op:__inference_predict_function_1150]

It can be reproduced by following the same steps as in https://github.com/keras-team/tf-keras/issues/125 and replace the number of blocks 5->12 (line 18) in the Colab.

Not sure if this adds more information to the issue found, but wanted to know what do you think why it fails with TF_CUDNN_WORKSPACE_LIMIT_IN_MB and not with TF_CUDNN_USE_FRONTEND? More generally, since each block should be processed sequentially by the GPU (with batch size=1), do we expect the memory usage to increase with the #blocks?

I have already created the bug ticket, but, it is internal to NVIDIA. Sorry for that. I think I can update this thread when I get some feedback.

Thank you - updating it here should be fine.

Btw, can you please share what is the use case in high level? Is that a real model or just some benchmarking? Thanks.

Sure, it's real clinical use case of a 3D U-net for segmentation of a large organ from a CT scan as input. The CT is too large to fit in GPU memory, so the input is first split into large blocks (each of shape 320^3 voxels), which are then fed to the model. Having a large block is important here, as we want to capture as much context as possible.

Logs

220311_t4_tesla_workspace_limit_12blocks.log

kaixih commented 2 years ago

Thanks for sharing the info.

More generally, since each block should be processed sequentially by the GPU (with batch size=1), do we expect the memory usage to increase with the #blocks?

Yes, more blocks mean you use more layers and more weights will stay in the GPU memory. As mentioned in https://github.com/keras-team/tf-keras/issues/125, we generally recommend users to switch to the frontend API.

goncinious commented 2 years ago

Sorry if I didn't explain that well - by blocks I meant the number of inputs that are passed to model.predict() while keeping batch size = 1 (e.g., if I'm feeding a 10x320^3x1, I except the model to process each 1x320^3x1 input 10 times, one at a time). This issue has also been observed in https://github.com/tensorflow/tensorflow/issues/40547.

kaixih commented 2 years ago

Ah, I see. I only noticed the depth = 5 and thought you were talking about the blocking blocks.

Anyway, in this case, I think the size of the model's weights should be constant. And I actually tried your colab code by modifying:

test_input = np.ones(shape=(15, *input_shape), dtype=np.float32)

with CUDA_VISIBLE_DEVICES=4 TF_CUDNN_USE_FRONTEND=0 TF_CUDNN_WORKSPACE_LIMIT_IN_MB=5000 on a T4-16GB GPU. Despite one warning, the execution works fine:

 1/15 [=>............................] - ETA: 1:392022-03-15 01:02:39.071383: W tensorflow/core/common_runtime/bfc_allocator.cc:275] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.47GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
15/15 [==============================] - 19s 822ms/step

goncinious commented 2 years ago

I believe you haven't upgraded cuDNN to >= 8.3 before running the script. However, we do need to upgrade it, so inference can work without OOM errors in both both T4 Tesla (compute capability=7.5) and A10g GPUs (compute capability=8.6). To upgrade cuDNN, I'm using steps 3-5 in https://github.com/keras-team/tf-keras/issues/125. If you first upgrade cuDNN and then run the script with the #inputs=15 (as you did), you should be able to reproduce my findings.

keras-team / tf-keras