CUDA_ERROR_LAUNCH_FAILED, CUDNN_STATUS_MAPPING_ERROR and CUDNN_STATUS_INTERNAL_ERROR in object detection

ldalzovo commented 6 years ago

System information

What is the top-level directory of the model you are using: object-detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): no
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04
TensorFlow installed from (source or binary): binary, NGC nvidia docker image 18.04 and tensorflow-gpu docker image
TensorFlow version (use command below): 1.7, but tried also 1.5 with the same issue
Bazel version (if compiling from source):
CUDA/cuDNN version: 9.0.176 and 7.1.1
GPU model and memory: 3 x Asus Turbo 1080TI 11GB
Exact command to reproduce:

I am running tensorflow 1.7 using NGC tensorflow 18.04 python 3 version provided by NVIDIA with nvidia-docker version 2 in Ubuntu 16.04. Other tests with the official docker tensorflow 1.5 and 1.7 gpu version fails too, and also installing manually cuda-cudnn-tensorflow without docker is not solving the issue. Latest driver version is installed 390.48.

I have two pc with the same identical software configuration. The first is a AMD Ryzen 1800X in ASUS PRIME-X370-PRO and CORSAIR CMK16GX4M2A2666C16 16GB RAM with a Gigabyte Aorus GTX 1080TI works just fine. The second one, having the following issue, is a AMD THREADRIPPER 1900X, GIGABYTE X399 AORUS GAMING 7, CORSAIR DOMINATOR CMD32GX4M4C3000C15 32GB RAM, 3 x ASUS Turbo GeForce GTX 1080 Ti, SSD SAMSUNG MZ-V6E1T0BW 960 EVO 1TB, PSU EVGA 1600W G2. All hardware is updated to the latest BIOS/drivers/recommended configuration.

When training with tensorflow using Object Detection API using default or customized config (one or multiple batch_size, image size from 300x300 up to 900x900, and other settings), all working properly using the first PC, I received these errors quite randomly after a small number of steps if launching the training just on one GPU with the same exact command:

example run 1: 2018-05-09 09:42:11.689599: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:42:11.689656: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 Aborted (core dumped)
example run 2: 2018-05-09 18:17:09.657236: E tensorflow/stream_executor/cuda/cuda_dnn.cc:404] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2018-05-09 18:17:09.657289: F tensorflow/core/kernels/conv_ops.cc:712] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms) Aborted (core dumped)

When I run the training using all three GPUs (adding the following parameters --num_clones=3 --ps_tasks=1), I receive these other errors:

example run 1: 2018-05-09 09:19:58.225805: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:19:58.225850: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:19:58.225809: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:19:58.225805: E tensorflow/stream_executor/cuda/cuda_dnn.cc:4220] failed to enqueue backward pooling on stream: CUDNN_STATUS_EXECUTION_FAILED 2018-05-09 09:19:58.225939: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 2018-05-09 09:19:58.225885: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 2018-05-09 09:19:58.225885: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 Aborted (core dumped)
example run 2: 2018-05-09 09:34:13.610405: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:34:13.610438: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:650] failed to record completion event; therefore, failed to create inter-stream dependency 2018-05-09 09:34:13.610405: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:34:13.610504: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at where_op.cc:287 : Internal: WhereOp: Could not launch cub::DeviceReduce::Sum to calculate temp_storage_bytes, status: unspecified launch failure 2018-05-09 09:34:13.610465: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 2018-05-09 09:34:13.610457: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:34:13.610570: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 2018-05-09 09:34:13.610513: I tensorflow/stream_executor/stream.cc:4755] stream 0x3d3d8fc0 did not memcpy device-to-host; source: 0x7f49d3947c00 Aborted (core dumped)
example run 3: 2018-05-09 09:37:18.401486: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:37:18.401532: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:650] failed to record completion event; therefore, failed to create inter-stream dependency 2018-05-09 09:37:18.401578: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:650] failed to record completion event; therefore, failed to create inter-stream dependency 2018-05-09 09:37:18.401595: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 2018-05-09 09:37:18.401609: E tensorflow/stream_executor/stream.cc:306] Error recording event in stream: error recording CUDA event on stream 0x8dcbec0: CUDA_ERROR_LAUNCH_FAILED; not marking stream as bad, as the Event object may be at fault. Monitor for further errors. 2018-05-09 09:37:18.401695: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:37:18.401712: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 Aborted (core dumped)
example run 4: 2018-05-09 09:38:31.869054: E tensorflow/stream_executor/cuda/cuda_driver.cc:1080] failed to synchronize the stop event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:38:31.869054: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x11677810: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:38:31.869102: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x9e6d110: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:38:31.869068: E tensorflow/stream_executor/cuda/cuda_timer.cc:54] Internal: error destroying CUDA event in context 0x10883840: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:38:31.869107: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x11677810: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:38:31.869115: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x9e6d110: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:38:31.869159: E tensorflow/stream_executor/cuda/cuda_timer.cc:59] Internal: error destroying CUDA event in context 0x10883840: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:38:31.869213: F tensorflow/stream_executor/cuda/cuda_dnn.cc:2359] failed to set stream for cudnn handle: CUDNN_STATUS_MAPPING_ERROR 2018-05-09 09:38:31.869217: F tensorflow/stream_executor/cuda/cuda_dnn.cc:2359] failed to set stream for cudnn handle: CUDNN_STATUS_MAPPING_ERROR 2018-05-09 09:38:31.869215: F tensorflow/stream_executor/cuda/cuda_dnn.cc:2359] failed to set stream for cudnn handle: CUDNN_STATUS_MAPPING_ERROR Aborted (core dumped)
example run 5: 2018-05-09 09:39:51.094413: E tensorflow/stream_executor/cuda/cuda_dnn.cc:2532] failed to enqueue convolution on stream: CUDNN_STATUS_EXECUTION_FAILED 2018-05-09 09:39:51.094416: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:650] failed to record completion event; therefore, failed to create inter-stream dependency 2018-05-09 09:39:51.094416: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:39:51.094438: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:39:51.094525: I tensorflow/stream_executor/stream.cc:4755] stream 0x3d5ff570 did not memcpy device-to-host; source: 0x7fddc78e6700 2018-05-09 09:39:51.094443: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 09:39:51.094538: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 2018-05-09 09:39:51.094536: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 2018-05-09 09:39:51.094573: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 Aborted (core dumped)
example run 6: 2018-05-09 10:38:13.687400: E tensorflow/stream_executor/cuda/cuda_dnn.cc:404] could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR 2018-05-09 10:38:13.687428: E tensorflow/stream_executor/cuda/cuda_driver.cc:1099] could not wait stream on event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 10:38:13.687456: F tensorflow/core/kernels/conv_ops.cc:712] Check failed: stream->parent()->GetConvolveAlgorithms( conv_parameters.ShouldIncludeWinogradNonfusedAlgo(), &algorithms) 2018-05-09 10:38:13.687499: I tensorflow/stream_executor/stream.cc:4768] stream 0x1e4d79d0 did not memcpy host-to-device; source: 0x7fd6cb08e600 2018-05-09 10:38:13.687519: E tensorflow/stream_executor/stream.cc:306] Error recording event in stream: error recording CUDA event on stream 0x248d2610: CUDA_ERROR_LAUNCH_FAILED; not marking stream as bad, as the Event object may be at fault. Monitor for further errors. Aborted (core dumped)
example run 7: 2018-05-09 10:40:33.279116: E tensorflow/stream_executor/cuda/cuda_dnn.cc:2827] failed to set stream for cudnn handle: CUDNN_STATUS_MAPPING_ERROR 2018-05-09 10:40:33.279116: E tensorflow/stream_executor/cuda/cuda_dnn.cc:2827] failed to set stream for cudnn handle: CUDNN_STATUS_MAPPING_ERROR 2018-05-09 10:40:33.279130: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 10:40:33.279132: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 10:40:33.279217: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 2018-05-09 10:40:33.279237: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 2018-05-09 10:40:33.279141: E tensorflow/stream_executor/cuda/cuda_event.cc:49] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED 2018-05-09 10:40:33.279284: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:203] Unexpected Event status: 1 Aborted (core dumped)

All the tests were made after restarting the system, no overclock in the CPU or GPU, I even tried downclocking the GPUs, other usages of the GPUs seems to work without problems. Some other people seems to have similar problems here, but none of the possible solutions provided (rm -rf ~/.nv/ to clean some cache and setting config.gpu_options.allow_growth = True to only allocate the needed memory) solved this problem.

What could be possible the root cause of this issue? The output is not helping to identify it...

robieta commented 6 years ago

@derekjchow Could you take a look?

robieta commented 6 years ago

Can you try changing the CUDA and CudNN versions and see if that has any effect?

robieta commented 6 years ago

Automatically closing due to lack of recent activity. Please update the issue when new information becomes available, and we will reopen the issue. Thanks!

aartighatkesar commented 6 years ago

What is the top-level directory of the model you are using: object_detection
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No except editing config file for training on my dataset
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 16.04.3
TensorFlow installed from (source or binary):
TensorFlow version (use command below): 1.9
Bazel version (if compiling from source):
CUDA/cuDNN version: 9.0.176 and 7.1.4
GPU model and memory: 12GB NVIDIA Tesla K80
Exact command to reproduce: python3.6 train.py --logtostderr --train_dir=/home/aarti/trainFolder_oldckpt/models/train200K --pipeline_config_path=/home/aarti/trainFolder_oldckpt/faster_rcnn_resnet101_friends.config

INFO:tensorflow:global step 10594: loss = 0.4910 (0.668 sec/step)
INFO:tensorflow:global step 10595: loss = 0.6275 (0.688 sec/step)
INFO:tensorflow:global step 10595: loss = 0.6275 (0.688 sec/step)
INFO:tensorflow:global step 10596: loss = 0.4151 (0.678 sec/step)
INFO:tensorflow:global step 10596: loss = 0.4151 (0.678 sec/step)
INFO:tensorflow:global step 10597: loss = 0.3782 (0.675 sec/step)
INFO:tensorflow:global step 10597: loss = 0.3782 (0.675 sec/step)
2018-08-21 21:12:27.568972: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED
2018-08-21 21:12:27.569040: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:206] Unexpected Event status: 1
2018-08-21 21:12:27.569043: E tensorflow/stream_executor/cuda/cuda_dnn.cc:2496] failed to enqueue convolution on stream: CUDNN_STATUS_MAPPING_ERROR
2018-08-21 21:12:27.568972: E tensorflow/stream_executor/cuda/cuda_driver.cc:1097] could not wait stream on event: CUDA_ERROR_LAUNCH_FAILED

@robieta

I’ve encountered this issue when I try to train faster_rcnn_resnet101_coco on my dataset. I have checked versions of CUDA and cuDNN and they are compatible with the requirements specified in the Object detection repo. Can you please let me know how to go about this error and what versions of CUDA/ cuDNN do you recommend?

amin07 commented 5 years ago

I am also having the same problem with cuda 9.0, cudnn 7.1 and tensorflow 1.10 Please help.. . . . . INFO:tensorflow:global step 4197: loss = 1.5479 (0.700 sec/step) INFO:tensorflow:global step 4197: loss = 1.5479 (0.700 sec/step) INFO:tensorflow:global step 4198: loss = 1.3282 (0.695 sec/step) INFO:tensorflow:global step 4198: loss = 1.3282 (0.695 sec/step) INFO:tensorflow:global step 4199: loss = 1.1196 (0.690 sec/step) INFO:tensorflow:global step 4199: loss = 1.1196 (0.690 sec/step) INFO:tensorflow:global step 4200: loss = 1.6031 (0.701 sec/step) INFO:tensorflow:global step 4200: loss = 1.6031 (0.701 sec/step) INFO:tensorflow:global step 4201: loss = 0.8924 (0.697 sec/step) INFO:tensorflow:global step 4201: loss = 0.8924 (0.697 sec/step) INFO:tensorflow:global step 4202: loss = 0.8030 (0.688 sec/step) INFO:tensorflow:global step 4202: loss = 0.8030 (0.688 sec/step) INFO:tensorflow:global step 4203: loss = 0.9386 (0.697 sec/step) INFO:tensorflow:global step 4203: loss = 0.9386 (0.697 sec/step) INFO:tensorflow:global step 4204: loss = 2.0923 (0.692 sec/step) INFO:tensorflow:global step 4204: loss = 2.0923 (0.692 sec/step) INFO:tensorflow:global step 4205: loss = 0.9616 (0.678 sec/step) INFO:tensorflow:global step 4205: loss = 0.9616 (0.678 sec/step) INFO:tensorflow:global step 4206: loss = 0.9810 (0.684 sec/step) INFO:tensorflow:global step 4206: loss = 0.9810 (0.684 sec/step) INFO:tensorflow:global step 4207: loss = 0.9181 (0.687 sec/step) INFO:tensorflow:global step 4207: loss = 0.9181 (0.687 sec/step) INFO:tensorflow:global step 4208: loss = 1.2980 (0.703 sec/step) INFO:tensorflow:global step 4208: loss = 1.2980 (0.703 sec/step) INFO:tensorflow:global step 4209: loss = 1.1381 (0.682 sec/step) INFO:tensorflow:global step 4209: loss = 1.1381 (0.682 sec/step) INFO:tensorflow:global step 4210: loss = 0.9842 (0.682 sec/step) INFO:tensorflow:global step 4210: loss = 0.9842 (0.682 sec/step) INFO:tensorflow:global step 4211: loss = 0.9015 (0.671 sec/step) INFO:tensorflow:global step 4211: loss = 0.9015 (0.671 sec/step) INFO:tensorflow:global step 4212: loss = 0.8821 (0.693 sec/step) INFO:tensorflow:global step 4212: loss = 0.8821 (0.693 sec/step) INFO:tensorflow:global step 4213: loss = 0.6155 (0.690 sec/step) INFO:tensorflow:global step 4213: loss = 0.6155 (0.690 sec/step) INFO:tensorflow:global step 4214: loss = 1.0448 (0.691 sec/step) INFO:tensorflow:global step 4214: loss = 1.0448 (0.691 sec/step) 2018-09-25 20:21:18.428631: E tensorflow/stream_executor/cuda/cuda_driver.cc:1227] failed to enqueue async memcpy from host to device: CUDA_ERROR_LAUNCH_FAILED; GPU dst: 0x120c16ab00; host src: 0x7f31ac400000; size: 4194304=0x400000 2018-09-25 20:21:18.428699: F tensorflow/stream_executor/cuda/cuda_dnn.cc:211] Check failed: status == CUDNN_STATUS_SUCCESS (7 vs. 0)Failed to set cuDNN stream. 2018-09-25 20:21:18.428750: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_LAUNCH_FAILED Aborted (core dumped)

tensorflow / models

CUDA_ERROR_LAUNCH_FAILED, CUDNN_STATUS_MAPPING_ERROR and CUDNN_STATUS_INTERNAL_ERROR in object detection #4227

System information