tensorflow / tensorflow

An Open Source Machine Learning Framework for Everyone
https://tensorflow.org
Apache License 2.0
185.82k stars 74.23k forks source link

Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. #25160

Closed Bahramudin closed 5 years ago

Bahramudin commented 5 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Describe the current behavior I have installed TF using pip, I have tested and it was able to detect the GPU, but when start to train, it throws the error below:

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node FirstStageFeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d (defined at C:\Users\bahra\Anaconda3\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py:2777) = Conv2D[T=DT_FLOAT, data_format="NCHW", dilations=[1, 1, 1, 1], padding="VALID", strides=[1, 1, 1, 1], use_cudnn_on_gpu=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](FirstStageFeatureExtractor/InceptionV2/InceptionV2/Conv2d_1a_7x7/separable_conv2d/depthwise, FirstStageFeatureExtractor/InceptionV2/Conv2d_1a_7x7/pointwise_weights/read/_165)]] [[{{node BatchMultiClassNonMaxSuppression/map/while/Exit_6/_76}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_1252_BatchMultiClassNonMaxSuppression/map/while/Exit_6", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Note I have tried TF 1.12, 1.11, and 1.8.0 all have the same problem. Why it throwing this error and how to solve?

Before this error, I was able to train, and it was successfully worked, but when to start the second time training then this error happens.

ymodak commented 5 years ago

duplicate #24828 Closing this issue so that we can focus on one thread. Thanks!

Bahramudin commented 5 years ago

@ymodak the possible solution which I found can be found here.

ymodak commented 5 years ago

Read your solution on the thread. Thanks a lot for sharing it, will keep a note of it moving forward.

Tyrones1995 commented 5 years ago

anaconda Python 3.7 cuda10.0.130 cudnn 7.5.1 the same error UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node Conv2D_12 (defined at :174) ]]

Tyrones1995 commented 5 years ago

Caused by op 'Conv2D_12', defined at: File "/home/wtl/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/home/wtl/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/wtl/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py", line 16, in app.launch_new_instance() File "/home/wtl/anaconda3/lib/python3.7/site-packages/traitlets/config/application.py", line 658, in launch_instance app.start() File "/home/wtl/anaconda3/lib/python3.7/site-packages/ipykernel/kernelapp.py", line 505, in start self.io_loop.start() File "/home/wtl/anaconda3/lib/python3.7/site-packages/tornado/platform/asyncio.py", line 132, in start self.asyncio_loop.run_forever() File "/home/wtl/anaconda3/lib/python3.7/asyncio/base_events.py", line 528, in run_forever self._run_once() File "/home/wtl/anaconda3/lib/python3.7/asyncio/base_events.py", line 1764, in _run_once handle._run() File "/home/wtl/anaconda3/lib/python3.7/asyncio/events.py", line 88, in _run self._context.run(self._callback, self._args) File "/home/wtl/anaconda3/lib/python3.7/site-packages/tornado/ioloop.py", line 758, in _run_callback ret = callback() File "/home/wtl/anaconda3/lib/python3.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper return fn(args, kwargs) File "/home/wtl/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1233, in inner self.run() File "/home/wtl/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run yielded = self.gen.send(value) File "/home/wtl/anaconda3/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 357, in process_one yield gen.maybe_future(dispatch(args)) File "/home/wtl/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 326, in wrapper yielded = next(result) File "/home/wtl/anaconda3/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 267, in dispatch_shell yield gen.maybe_future(handler(stream, idents, msg)) File "/home/wtl/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 326, in wrapper yielded = next(result) File "/home/wtl/anaconda3/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 534, in execute_request user_expressions, allow_stdin, File "/home/wtl/anaconda3/lib/python3.7/site-packages/tornado/gen.py", line 326, in wrapper yielded = next(result) File "/home/wtl/anaconda3/lib/python3.7/site-packages/ipykernel/ipkernel.py", line 294, in do_execute res = shell.run_cell(code, store_history=store_history, silent=silent) File "/home/wtl/anaconda3/lib/python3.7/site-packages/ipykernel/zmqshell.py", line 536, in run_cell return super(ZMQInteractiveShell, self).run_cell(args, kwargs) File "/home/wtl/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2819, in run_cell raw_cell, store_history, silent, shell_futures) File "/home/wtl/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 2845, in _run_cell return runner(coro) File "/home/wtl/anaconda3/lib/python3.7/site-packages/IPython/core/async_helpers.py", line 67, in _pseudo_sync_runner coro.send(None) File "/home/wtl/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3020, in run_cell_async interactivity=interactivity, compiler=compiler, result=result) File "/home/wtl/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3185, in run_ast_nodes if (yield from self.run_code(code, result)): File "/home/wtl/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3267, in run_code exec(code_obj, self.user_global_ns, self.user_ns) File "", line 318, in tf.app.run(main=main, argv=[sys.argv[0]] + unparsed) File "/home/wtl/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "", line 208, in main logits = model(train_data_node, True) File "", line 174, in model padding='SAME') File "/home/wtl/anaconda3/lib/python3.7/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 1026, in conv2d data_format=data_format, dilations=dilations, name=name) File "/home/wtl/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/home/wtl/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/home/wtl/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/home/wtl/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in init self._traceback = tf_stack.extract_stack()

UnknownError (see above for traceback): Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above. [[node Conv2D_12 (defined at :174) ]]

leimao commented 5 years ago

I am using TF container and this also happened. Before this error, I was able to train, and it was successfully worked, but when to start the second time training then this error happens. What is the problem here?

Fotomaterjal commented 4 years ago

Using cudnn-7.4.2 and cuda-10.0. Both should be usable by tensorflow 2.0 https://www.tensorflow.org/install/source#tested_build_configurations but I'm running into the same issue as described above.

Fabian-Sc85 commented 4 years ago

I was able to resolve this by installing an updated version of the libcudnn library (7.6.5 - I am using cuda 10.0 on Ubuntu 16.04) from the NVIDIA developer page.

Fotomaterjal commented 4 years ago

I managed to get it running with cudnn/7.6.3/cuda-10.0 on CentOS Linux 7.6.1810

viditvarshney commented 4 years ago

Same error i got , The Reason of getting this error is due to the mismatch of the version of the cudaa/cudnn with your tensorflow version there are two methods to solve this:

  1. Either you Downgrade your Tensorflow Version pip install --upgrade tensorflowgpu==1.8.0

  2. Or You can follow the steps at Here.

    tip: Choose your ubuntu version and follow the steps.:-)

dsivakumar commented 4 years ago

Workaround: Fresh install TF 2.0 and ran a simple Minst tutorial, it was alright, opened another notebook, tried to run and encountered this issue. I existed all notebooks and restarted Jupyter and open only one notebook, ran it successfully Issue seems to be either memory or running more than one notebook on GPU

Thanks

bryanbocao commented 4 years ago

@roebel https://github.com/tensorflow/tensorflow/issues/24496#issuecomment-630420518 @kabylan https://github.com/tensorflow/tensorflow/issues/24496#issuecomment-641978924

I think it's related to GPU memory as well, I was trying to run https://keras.io/examples/rl/deep_q_network_breakout/ before 2020/06/17 in the following environment:

GPU: GeForce RTX 2060 Super OS/Driver/Lib Version
Ubuntu 18.04.4 LTS
GPU Driver 450.36.06
CUDA 11.0
Tensorflow 2.2.0
Keras 2.3.1

Before running the code, the GPU status is:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 206...  On   | 00000000:01:00.0 Off |                  N/A |
| 41%   33C    P8     8W / 175W |      1MiB /  7979MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

After running the code, I got the error:

UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
     [[node model_1/conv2d/Conv2D (defined at <ipython-input-4-aa1698769333>:87) ]] [Op:__inference_predict_function_229]

Function call stack:
predict_function

And the GPU memory is almost full:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.36.06    Driver Version: 450.36.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 206...  On   | 00000000:01:00.0 Off |                  N/A |
| 41%   33C    P2    40W / 175W |   7902MiB /  7979MiB |      9%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      6753      C   ...Apps/anaconda3/bin/python     7899MiB |
+-----------------------------------------------------------------------------+

I tried

tf.config.experimental.set_memory_growth = True

but still had the same issue.


Although the code has never raised GPU/CUDA memory errors, observed that the GPU memory was almost full(7902MiB/7979MiB), one way to sovle this issue is by limiting the GPU memory usage manually in my case:

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    # Restrict TensorFlow to only allocate 1GB * 2 of memory on the first GPU
    try:
        tf.config.experimental.set_virtual_device_configuration(
            gpus[0],
            [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024 * 2)])
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Virtual devices must be set before GPUs have been initialized
        print(e)

Check how to limit GPU memory: https://www.tensorflow.org/guide/gpu

Probably this is because cuDNN tried to initialize but there isn't enough GPU memory left. Setting a hard GPU memory limit explicitly would let it know the extent to which it can use GPU memory to initialize GPU-mem-used variables, then the Tensorflow backend will figure out how much and when to allocate proper GPU memory for certain GPU-mem-used variables. Though this argument needs further verified.


Note that the code of https://keras.io/examples/rl/deep_q_network_breakout/ has been updated on 2020/06/17 and the memory filling issue no longer exists.

SandraMnz commented 4 years ago

@BryanBo-Cao I had the same error with:

GPU: GeForce RTX 2060 OS/Driver/Lib Version
Ubuntu 18.04.4 LTS
GPU Driver 435.21
CUDA 10.1
Tensorflow 2.2.0
Keras 2.3.0

And your solution for limiting the GPU usage is the only one that has worked to me.

swordspoet commented 3 years ago

@BryanBo-Cao I had the same error with:

GPU: GeForce RTX 2060

OS/Driver/Lib Version Ubuntu 18.04.4 LTS GPU Driver 435.21 CUDA 10.1 Tensorflow 2.2.0 Keras 2.3.0 And your solution for limiting the GPU usage is the only one that has worked to me.

If this doesn't fix your problem, then upgrade your cuDNN, it works for me.

ebrarsahin commented 3 years ago

SOLUTION: Go to train.py add these codes

from tensorflow.compat.v1 import ConfigProto from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto() config.gpu_options.allow_growth = True session = InteractiveSession(config=config)

This way worked for me.

helonayala commented 3 years ago

SOLUTION: Go to train.py add these codes

from tensorflow.compat.v1 import ConfigProto from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto() config.gpu_options.allow_growth = True session = InteractiveSession(config=config)

This way worked for me.

thanks mate, works here too

HaoranCheng commented 4 months ago

the same problem with the tensorflow-gpu2.0.0 and the Cuda 10.0.0+ cudnn 7.4. Finally solved it by updating them to tensorflow-gpu 2.1.0 + CUDA 10.1 + cudnn 7.6.5 since Python 3.7 is what I need.