tensorflow / models

Models and examples built with TensorFlow
Other
76.94k stars 45.79k forks source link

CUDA_ERROR_LAUNCH_FAILED, CUDNN_STATUS_MAPPING_ERROR and CUDNN_STATUS_INTERNAL_ERROR in object detection #4227

Closed ldalzovo closed 6 years ago

ldalzovo commented 6 years ago

System information

I am running tensorflow 1.7 using NGC tensorflow 18.04 python 3 version provided by NVIDIA with nvidia-docker version 2 in Ubuntu 16.04. Other tests with the official docker tensorflow 1.5 and 1.7 gpu version fails too, and also installing manually cuda-cudnn-tensorflow without docker is not solving the issue. Latest driver version is installed 390.48.

I have two pc with the same identical software configuration. The first is a AMD Ryzen 1800X in ASUS PRIME-X370-PRO and CORSAIR CMK16GX4M2A2666C16 16GB RAM with a Gigabyte Aorus GTX 1080TI works just fine. The second one, having the following issue, is a AMD THREADRIPPER 1900X, GIGABYTE X399 AORUS GAMING 7, CORSAIR DOMINATOR CMD32GX4M4C3000C15 32GB RAM, 3 x ASUS Turbo GeForce GTX 1080 Ti, SSD SAMSUNG MZ-V6E1T0BW 960 EVO 1TB, PSU EVGA 1600W G2. All hardware is updated to the latest BIOS/drivers/recommended configuration.

When training with tensorflow using Object Detection API using default or customized config (one or multiple batch_size, image size from 300x300 up to 900x900, and other settings), all working properly using the first PC, I received these errors quite randomly after a small number of steps if launching the training just on one GPU with the same exact command:

When I run the training using all three GPUs (adding the following parameters --num_clones=3 --ps_tasks=1), I receive these other errors:

All the tests were made after restarting the system, no overclock in the CPU or GPU, I even tried downclocking the GPUs, other usages of the GPUs seems to work without problems. Some other people seems to have similar problems here, but none of the possible solutions provided (rm -rf ~/.nv/ to clean some cache and setting config.gpu_options.allow_growth = True to only allocate the needed memory) solved this problem.

What could be possible the root cause of this issue? The output is not helping to identify it...

robieta commented 6 years ago

@derekjchow Could you take a look?

robieta commented 6 years ago

Can you try changing the CUDA and CudNN versions and see if that has any effect?

robieta commented 6 years ago

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!

aartighatkesar commented 6 years ago
INFO:tensorflow:global step 10594: loss = 0.4910 (0.668 sec/step)
INFO:tensorflow:global step 10595: loss = 0.6275 (0.688 sec/step)
INFO:tensorflow:global step 10595: loss = 0.6275 (0.688 sec/step)
INFO:tensorflow:global step 10596: loss = 0.4151 (0.678 sec/step)
INFO:tensorflow:global step 10596: loss = 0.4151 (0.678 sec/step)
INFO:tensorflow:global step 10597: loss = 0.3782 (0.675 sec/step)
INFO:tensorflow:global step 10597: loss = 0.3782 (0.675 sec/step)
2018-08-21 21:12:27.568972: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED
2018-08-21 21:12:27.569040: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:206] Unexpected Event status: 1
2018-08-21 21:12:27.569043: E tensorflow/stream_executor/cuda/cuda_dnn.cc:2496] failed to enqueue convolution on stream: CUDNN_STATUS_MAPPING_ERROR
2018-08-21 21:12:27.568972: E tensorflow/stream_executor/cuda/cuda_driver.cc:1097] could not wait stream on event: CUDA_ERROR_LAUNCH_FAILED

@robieta

I’ve encountered this issue when I try to train faster_rcnn_resnet101_coco on my dataset. I have checked versions of CUDA and cuDNN and they are compatible with the requirements specified in the Object detection repo. Can you please let me know how to go about this error and what versions of CUDA/ cuDNN do you recommend?

amin07 commented 5 years ago

I am also having the same problem with cuda 9.0, cudnn 7.1 and tensorflow 1.10 Please help.. . . . . INFO:tensorflow:global step 4197: loss = 1.5479 (0.700 sec/step) INFO:tensorflow:global step 4197: loss = 1.5479 (0.700 sec/step) INFO:tensorflow:global step 4198: loss = 1.3282 (0.695 sec/step) INFO:tensorflow:global step 4198: loss = 1.3282 (0.695 sec/step) INFO:tensorflow:global step 4199: loss = 1.1196 (0.690 sec/step) INFO:tensorflow:global step 4199: loss = 1.1196 (0.690 sec/step) INFO:tensorflow:global step 4200: loss = 1.6031 (0.701 sec/step) INFO:tensorflow:global step 4200: loss = 1.6031 (0.701 sec/step) INFO:tensorflow:global step 4201: loss = 0.8924 (0.697 sec/step) INFO:tensorflow:global step 4201: loss = 0.8924 (0.697 sec/step) INFO:tensorflow:global step 4202: loss = 0.8030 (0.688 sec/step) INFO:tensorflow:global step 4202: loss = 0.8030 (0.688 sec/step) INFO:tensorflow:global step 4203: loss = 0.9386 (0.697 sec/step) INFO:tensorflow:global step 4203: loss = 0.9386 (0.697 sec/step) INFO:tensorflow:global step 4204: loss = 2.0923 (0.692 sec/step) INFO:tensorflow:global step 4204: loss = 2.0923 (0.692 sec/step) INFO:tensorflow:global step 4205: loss = 0.9616 (0.678 sec/step) INFO:tensorflow:global step 4205: loss = 0.9616 (0.678 sec/step) INFO:tensorflow:global step 4206: loss = 0.9810 (0.684 sec/step) INFO:tensorflow:global step 4206: loss = 0.9810 (0.684 sec/step) INFO:tensorflow:global step 4207: loss = 0.9181 (0.687 sec/step) INFO:tensorflow:global step 4207: loss = 0.9181 (0.687 sec/step) INFO:tensorflow:global step 4208: loss = 1.2980 (0.703 sec/step) INFO:tensorflow:global step 4208: loss = 1.2980 (0.703 sec/step) INFO:tensorflow:global step 4209: loss = 1.1381 (0.682 sec/step) INFO:tensorflow:global step 4209: loss = 1.1381 (0.682 sec/step) INFO:tensorflow:global step 4210: loss = 0.9842 (0.682 sec/step) INFO:tensorflow:global step 4210: loss = 0.9842 (0.682 sec/step) INFO:tensorflow:global step 4211: loss = 0.9015 (0.671 sec/step) INFO:tensorflow:global step 4211: loss = 0.9015 (0.671 sec/step) INFO:tensorflow:global step 4212: loss = 0.8821 (0.693 sec/step) INFO:tensorflow:global step 4212: loss = 0.8821 (0.693 sec/step) INFO:tensorflow:global step 4213: loss = 0.6155 (0.690 sec/step) INFO:tensorflow:global step 4213: loss = 0.6155 (0.690 sec/step) INFO:tensorflow:global step 4214: loss = 1.0448 (0.691 sec/step) INFO:tensorflow:global step 4214: loss = 1.0448 (0.691 sec/step) 2018-09-25 20:21:18.428631: E tensorflow/stream_executor/cuda/cuda_driver.cc:1227] failed to enqueue async memcpy from host to device: CUDA_ERROR_LAUNCH_FAILED; GPU dst: 0x120c16ab00; host src: 0x7f31ac400000; size: 4194304=0x400000 2018-09-25 20:21:18.428699: F tensorflow/stream_executor/cuda/cuda_dnn.cc:211] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream. 2018-09-25 20:21:18.428750: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED Aborted (core dumped)