Closed andrenatal closed 4 years ago
Great work @lissyx ! Will see if I can find some time this weekend to see if I can get that stuff to work .
Was 7.4.2.24 the last of the cudnn 7.4 versions ? That would suggest it was introduced somewhere in 7.5 series.
Great work @lissyx ! Will see if I can find some time this weekend to see if I can get that stuff to work .
Was 7.4.2.24 the last of the cudnn 7.4 versions ? That would suggest it was introduced somewhere in 7.5 series.
It could still be in a lot of places, who knows exactly. But at least even if inconvenient it might help unblocking
@lissyx
My results are different, logs are attached.:
So maybe it was just luck or we lack another parameter. Please note we don't have the same GPUs, and so the same memory. Maybe it is why.
On July 13, 2020 6:03:47 PM GMT+02:00, lissyx notifications@github.com wrote:
So maybe it was just luck or we lack another parameter. Please note we don't have the same GPUs, and so the same memory. Maybe it is why.
What we haven't tested, is TF14 builds with those cudnn version.
On July 13, 2020 6:03:47 PM GMT+02:00, lissyx @.***> wrote: So maybe it was just luck or we lack another parameter. Please note we don't have the same GPUs, and so the same memory. Maybe it is why. What we haven't tested, is TF14 builds with those cudnn version.
I can reproduce the issue here on 7.4 as well by limiting visible GPU to only one (number 0 or 1). When I expose both, it works.
@applied-machinelearning I found that hack to help locally, after getting more repro:
tf-docker ~/ds > git diff
diff --git a/training/deepspeech_training/util/feeding.py b/training/deepspeech_training/util/feeding.py
index 4c9b681d..4cddca22 100644
--- a/training/deepspeech_training/util/feeding.py
+++ b/training/deepspeech_training/util/feeding.py
@@ -48,7 +48,7 @@ def audio_to_features(audio, sample_rate, transcript=None, clock=0.0, train_phas
if train_phase and augmentations is not None:
features = apply_graph_augmentations('features', features, augmentations, transcript=transcript, clock=clock)
- return features, tf.shape(input=features)[0]
+ return features, tf.shape(input=features)[0] - 1
def audiofile_to_features(wav_filename, clock=0.0, train_phase=False, augmentations=None):
I can't explain yet why, and I'd like your feedback if you can corroborate if it helps on all your repro cases or not
On July 13, 2020 6:03:47 PM GMT+02:00, lissyx @.***> wrote: So maybe it was just luck or we lack another parameter. Please note we don't have the same GPUs, and so the same memory. Maybe it is why. What we haven't tested, is TF14 builds with those cudnn version.
I can reproduce the issue here on 7.4 as well by limiting visible GPU to only one (number 0 or 1). When I expose both, it works.
Don't know the inner-workings of training on multi-gpu, but if it is interleaving the batches, with the very small test set, your second GPU could have batch B, but as that GPU's first step, so the special case could apply there. I wonder if it still works with multi-gpu if you repeat some of the other batches before batch B, so it will never be the first step of a GPU.
BTW last few days I have trained all my datasets on the image based on tensorflow/tensorflow:1.14.0-gpu-py3 and I haven't had a problem. The only issue is not being able to get "convert_graphdef_memmapped_format" via taskcluster, since that file is gone form the mozilla infrastructure for the 1.14 branch.
@applied-machinelearning I found that hack to help locally, after getting more repro: ... I can't explain yet why, and I'd like your feedback if you can corroborate if it helps on all your repro cases or not
I will give that a shot this evening. :)
hacking the stride value also seems to do something (obviously, I have no idea why):
diff --git a/training/deepspeech_training/util/feeding.py b/training/deepspeech_training/util/feeding.py
index 4c9b681d..ae50e4f9 100644
--- a/training/deepspeech_training/util/feeding.py
+++ b/training/deepspeech_training/util/feeding.py
@@ -33,7 +33,7 @@ def audio_to_features(audio, sample_rate, transcript=None, clock=0.0, train_phas
spectrogram = contrib_audio.audio_spectrogram(audio,
window_size=Config.audio_window_samples,
- stride=Config.audio_step_samples,
+ stride=Config.audio_step_samples+1,
magnitude_squared=True)
if train_phase and augmentations is not None:
The only issue is not being able to get "convert_graphdef_memmapped_format" via taskcluster, since that file is gone form the mozilla infrastructure for the 1.14 branch.
You can just rebuild it, it's a bit time consuming but not complicated
Don't know the inner-workings of training on multi-gpu, but if it is interleaving the batches, with the very small test set, your second GPU could have batch B, but as that GPU's first step, so the special case could apply there. I wonder if it still works with multi-gpu if you repeat some of the other batches before batch B, so it will never be the first step of a GPU.
Yeah; but we still don't know what that special case here is
FTR the offending call is at https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/stream_executor/cuda/cuda_dnn.cc#L1785-L1798
And that's directly within libcudnn7 :/
Here also, report that driver v431.36 downgrade fixes the very similar error: https://stackoverflow.com/questions/62612226/tensorflow-check-failed-status-cudnn-status-success-7-vs-0failed-to-set-c
@applied-machinelearning I found that hack to help locally, after getting more repro:
tf-docker ~/ds > git diff diff --git a/training/deepspeech_training/util/feeding.py b/training/deepspeech_training/util/feeding.py index 4c9b681d..4cddca22 100644 --- a/training/deepspeech_training/util/feeding.py +++ b/training/deepspeech_training/util/feeding.py @@ -48,7 +48,7 @@ def audio_to_features(audio, sample_rate, transcript=None, clock=0.0, train_phas if train_phase and augmentations is not None: features = apply_graph_augmentations('features', features, augmentations, transcript=transcript, clock=clock) - return features, tf.shape(input=features)[0] + return features, tf.shape(input=features)[0] - 1 def audiofile_to_features(wav_filename, clock=0.0, train_phase=False, augmentations=None):
I can't explain yet why, and I'd like your feedback if you can corroborate if it helps on all your repro cases or not
So this one works for me as well.
I also printed the original shape, for batch B that is 75 which seems to match the 75 max_sequence_length in the exception from cudnn when things do crash.
FTR the offending call is at https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/stream_executor/cuda/cuda_dnn.cc#L1785-L1798
And that's directly within libcudnn7 :/
Yeah that was likely and unfortunately there is no such things as a neat error message.
Here also, report that driver v431.36 downgrade fixes the very similar error: https://stackoverflow.com/questions/62612226/tensorflow-check-failed-status-cudnn-status-success-7-vs-0failed-to-set-c
Hmmm dusted off my google-foo, but still could not find a linux download of v431.36. However, the release date (for windows at least) for v431.36 seems to be 07-09-2019. What I tested was 430.64 which is lower in version number, but later in release date: November 5, 2019.
So tomorrow I will see if I can test with: 430.40 which has release date: July 29, 2019, so both metrics are lower.
I also printed the original shape, for batch B that is 75 which seems to match the 75 max_sequence_length in the exception from cudnn when things do crash.
I thought the same, but hacking and forcing +1 on features_len
, the crash would happen on value 76, and previous values would become 75 without problem, it seems (also the error changed).
OK, so I have tried extra drivers released before the infamous "v431.36": 410.93, 418.74, 418.88, 430.34. None of them works for me.
I'm trying, but failing to so far, to build tf 1.15 pip with some debug enabled, outside of the docker setup they have, so I can at least get more insight on the offending call
Hmm I finally figured out the probable cudnn version of the tensorflow/tensorflow:1.14.0-gpu-py3 image. According to: https://hub.docker.com/layers/tensorflow/tensorflow/1.14.0-gpu-py3/images/sha256-e72e66b3dcb9c9e8f4e5703965ae1466b23fe8cad59e1c92c6e9fa58f8d81dc8?context=explore it should be CUDA 10.0.130-1 with CUDNN 7.4.1.5-1. The lowest cudnn we checked with the images you build was: issue3088:7.4.2.24, could it be worth while to also check a build with 7.4.1 ? I don't see anything very obviously related in the cudnn release notes on https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_7xx.html#rel_742 though.
The lowest cudnn we checked with the images you build was: issue3088:7.4.2.24, could it be worth while to also check a build with 7.4.1 ?
pretty sure i dont even need to rebuild, ill check that later
hacking the stride value also seems to do something (obviously, I have no idea why):
diff --git a/training/deepspeech_training/util/feeding.py b/training/deepspeech_training/util/feeding.py index 4c9b681d..ae50e4f9 100644 --- a/training/deepspeech_training/util/feeding.py +++ b/training/deepspeech_training/util/feeding.py @@ -33,7 +33,7 @@ def audio_to_features(audio, sample_rate, transcript=None, clock=0.0, train_phas spectrogram = contrib_audio.audio_spectrogram(audio, window_size=Config.audio_window_samples, - stride=Config.audio_step_samples, + stride=Config.audio_step_samples+1, magnitude_squared=True) if train_phase and augmentations is not None:
I tried this patch now, and that works for the small sets. The max_sequence_length for batch B has turned from 75 into 74 now.
But if I run the larger test set (train_differ_para_sorted_wav_filesize.log), it still blows up, now on files that end up having a max_sequence_length of 75 ...
train_debug_As_Bs_Cs.log train_debug_mini_As_Bs_Cs.log train_differ_para_sorted_wav_filesize.log
The lowest cudnn we checked with the images you build was: issue3088:7.4.2.24, could it be worth while to also check a build with 7.4.1 ?
Downgraded to 7.4.1.5:
tf-docker ~ > apt-cache policy libcudnn7
libcudnn7:
Installed: 7.4.1.5-1+cuda10.0
Candidate: 7.6.5.32-1+cuda10.2
Version table:
7.6.5.32-1+cuda10.2 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.5.32-1+cuda10.1 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.5.32-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.4.38-1+cuda10.1 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.4.38-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.3.30-1+cuda10.1 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.3.30-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.2.24-1+cuda10.1 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.2.24-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.1.34-1+cuda10.1 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.1.34-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.0.64-1+cuda10.1 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.6.0.64-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.5.1.10-1+cuda10.1 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.5.1.10-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.5.0.56-1+cuda10.1 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.5.0.56-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.4.2.24-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
*** 7.4.1.5-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
100 /var/lib/dpkg/status
7.3.1.20-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
7.3.0.29-1+cuda10.0 500
500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 Packages
Still blows up.
After lot of hacking, I've been able to rebuild locally outside of their docker (easier for playing with gdb), building and running against a pyenv-built python, and that builds reproduces the issue, so I'm preparing a debug build.
Debug build with CUDA is ... challenging. Trying this as suggested: https://github.com/tensorflow/tensorflow/issues/28091#issuecomment-488327539
The road to a debug build is ... complicated.
[12,032 / 15,573] 305 actions, 128 running
Compiling tensorflow/core/kernels/cwise_op_gpu_bitwise_and.cu.cc [for host]; 106s local
Compiling tensorflow/core/kernels/cwise_op_gpu_bitwise_or.cu.cc [for host]; 106s local
Compiling tensorflow/core/kernels/cwise_op_gpu_bitwise_xor.cu.cc [for host]; 106s local
Compiling tensorflow/core/kernels/cwise_op_gpu_add.cu.cc [for host]; 106s local
Compiling tensorflow/core/kernels/cwise_op_gpu_div.cu.cc [for host]; 104s local
Compiling tensorflow/core/kernels/cwise_op_gpu_equal_to.cu.cc [for host]; 102s local
Compiling tensorflow/core/kernels/cwise_op_gpu_left_shift.cu.cc [for host]; 101s local
Compiling tensorflow/core/kernels/cwise_op_gpu_floor_div.cu.cc [for host]; 99s local ...
Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/home/alexandre/.cache/bazel/_bazel_alexandre/93bedb94245f10d899bd4ce902050079/server/jvm.out')
alexandre@serveur:~/Documents/codaz/Mozilla/DeepSpeech/tensorflow-lissyx$ aaa^C
alexandre@serveur:~/Documents/codaz/Mozilla/DeepSpeech/tensorflow-lissyx$ ll /home/alexandre/.cache/bazel/_bazel_alexandre/93bedb94245f10d899bd4ce902050079/server/jvm.out
-rw-r--r-- 1 alexandre alexandre 822 17 juil. 18:41 /home/alexandre/.cache/bazel/_bazel_alexandre/93bedb94245f10d899bd4ce902050079/server/jvm.out
alexandre@serveur:~/Documents/codaz/Mozilla/DeepSpeech/tensorflow-lissyx$ cat /home/alexandre/.cache/bazel/_bazel_alexandre/93bedb94245f10d899bd4ce902050079/server/jvm.out
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGBUS (0x7) at pc=0x00007fcea090109e, pid=1171461, tid=1171475
#
# JRE version: OpenJDK Runtime Environment (Zulu11.29+3-CA) (11.0.2+7) (build 11.0.2+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM (11.0.2+7-LTS, mixed mode, tiered, compressed oops, parallel gc, linux-amd64)
# Problematic frame:
# V [libjvm.so+0xc5309e] PerfLongVariant::sample()+0x1e
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/alexandre/Documents/codaz/Mozilla/DeepSpeech/tensorflow-lissyx/hs_err_pid1171461.log
#
# If you would like to submit a bug report, please visit:
# http://www.azulsystems.com/support/
#
Ugh the nightmare of a build-system called "Bazel".
Ugh the nightmare of a build-system called "Bazel".
I guess in this case it's just I was running out of space on /
because of docker not properly pruning some resources.
Side effect: I have to rebuild all my dockers images / containers ...
Ah yes you have to be careful with pruning since every change from a buildfile is it's own image layer due to the caching stuff. Works nice in sparing space, but if you want to delete old stuff it can be a nightmare. I try to get accustomed to dumping the images that a care about as a tar-file with everything included first, so I can restore that stuff if need be.
Ah yes you have to be careful with pruning since every change from a buildfile is it's own image layer due to the caching stuff. Works nice in sparing space, but if you want to delete old stuff it can be a nightmare. I try to get accustomed to dumping the images that a care about as a tar-file with everything included first, so I can restore that stuff if need be.
Indeed, I have been doing my house-keeping but it seems to have not had completely cleaned up some things :/. Anyway, I now have something that should have more debug infos
alexandre@serveur:~/tmp/issue3088$ ll wheel_dst/tensorflow_gpu_local-1.15.0-cp37-cp37m-linux_x86_64.whl
-rw-r--r-- 1 alexandre alexandre 1,9G 20 juil. 14:26 wheel_dst/tensorflow_gpu_local-1.15.0-cp37-cp37m-linux_x86_64.whl
And at least I repro with this build as well.
Nothing obvious pops:
[Switching to Thread 0x7ff74ffff700 (LWP 209659)]
Thread 526 "DeepSpeech.py" hit Breakpoint 1, cudnnRNNForwardTrainingEx (handle=0x7ff71a00a5f0, rnnDesc=0x7ff748025900, xDesc=0x7ff748021870, x=0x7ff48da4dd00, hxDesc=0x7ff748017210, hx=0x7ff48b4d4300, cxDesc=0x7ff748006990, cx=0x7ff48b4d4300, wDesc=0x7ff748023f50, w=0x7ff492002900, yDesc=0x7ff74801f540, y=0x7ff48dce7d00, hyDesc=0x7ff748017210, hy=0x7ff48de0fd00, cyDesc=0x7ff748006990, cy=0x7ff48de13d00,
kDesc=0x0, keys=0x0, cDesc=0x0, cAttn=0x0, iDesc=0x0, iAttn=0x0, qDesc=0x0, queries=0x0, workSpace=0x7ff4aa032900, workSpaceSizeInBytes=136609792, reserveSpace=0x7ff48de17d00, reserveSpaceSizeInBytes=6062080) at ./tensorflow/stream_executor/cuda/cudnn_7_6.inc:2307
2307 ./tensorflow/stream_executor/cuda/cudnn_7_6.inc: Aucun fichier ou dossier de ce type.
(gdb) cont
Continuing.
[Thread 0x7ffeb4874700 (LWP 209805) exited]
Thread 526 "DeepSpeech.py" hit Breakpoint 1, 0x00007ff981aa0d50 in cudnnRNNForwardTrainingEx () from /home/alexandre/Documents/codaz/Mozilla/DeepSpeech/CUDA-10.0/lib64/libcudnn.so.7
(gdb) cont
Continuing.
[Detaching after fork from child process 209995]
[Switching to Thread 0x7ff82d7fa700 (LWP 209615)]
Thread 482 "DeepSpeech.py" hit Breakpoint 1, cudnnRNNForwardTrainingEx (handle=0x7ff8100081e0, rnnDesc=0x7ff1233fab70, xDesc=0x7ff1230fc010, x=0x7ff1a9a68800, hxDesc=0x7ff1233faa70, hx=0x7ff1a9d0b800, cxDesc=0x7ff123344d30, cx=0x7ff1a9d0b800, wDesc=0x7ff1230fbfd0, w=0x7ff1a150d800, yDesc=0x7ff1230fc050, y=0x7ff1a9d0f800, hyDesc=0x7ff1233faa70, hy=0x7ff1a9e3b800, cyDesc=0x7ff123344d30, cy=0x7ff1a9e3f800,
kDesc=0x0, keys=0x0, cDesc=0x0, cAttn=0x0, iDesc=0x0, iAttn=0x0, qDesc=0x0, queries=0x0, workSpace=0x7ff1a9e43800, workSpaceSizeInBytes=139886624, reserveSpace=0x7ff1b23ab900, reserveSpaceSizeInBytes=6144000) at ./tensorflow/stream_executor/cuda/cudnn_7_6.inc:2307
2307 in ./tensorflow/stream_executor/cuda/cudnn_7_6.inc
(gdb)
Continuing.
Thread 482 "DeepSpeech.py" hit Breakpoint 1, 0x00007ff981aa0d50 in cudnnRNNForwardTrainingEx () from /home/alexandre/Documents/codaz/Mozilla/DeepSpeech/CUDA-10.0/lib64/libcudnn.so.7
(gdb)
Continuing.
Epoch 0 | Training | Elapsed Time: 0:01:06 | Steps: 1 | Loss: 190.842316
--------------------------------------------------------------------------------
Epoch 1 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
[...]
[Switching to Thread 0x7ff74effd700 (LWP 209661)]
Thread 528 "DeepSpeech.py" hit Breakpoint 1, cudnnRNNForwardTrainingEx (handle=0x7ff71a00a5f0, rnnDesc=0x7ff748025900, xDesc=0x7ff74402f660, x=0x7ff48da5fc00, hxDesc=0x7ff7440293a0, hx=0x7ff48dd02c00, cxDesc=0x7ff744029310, cx=0x7ff48dd02c00, wDesc=0x7ff748023f50, w=0x7ff492002900, yDesc=0x7ff74402bfc0, y=0x7ff48dd06c00, hyDesc=0x7ff7440293a0, hy=0x7ff48de32c00, cyDesc=0x7ff744029310, cy=0x7ff48de36c00,
kDesc=0x0, keys=0x0, cDesc=0x0, cAttn=0x0, iDesc=0x0, iAttn=0x0, qDesc=0x0, queries=0x0, workSpace=0x7ff4aa032900, workSpaceSizeInBytes=136609792, reserveSpace=0x7ff48de3ac00, reserveSpaceSizeInBytes=6144000) at ./tensorflow/stream_executor/cuda/cudnn_7_6.inc:2307
2307 in ./tensorflow/stream_executor/cuda/cudnn_7_6.inc
(gdb)
Continuing.
[Switching to Thread 0x7ff72effd700 (LWP 209668)]
Thread 535 "DeepSpeech.py" hit Breakpoint 1, cudnnRNNForwardTrainingEx (handle=0x7ff8100081e0, rnnDesc=0x7ff1233fab70, xDesc=0x7ff4290342e0, x=0x7ff1a9a56900, hxDesc=0x7ff168009dc0, hx=0x7ff1a9cf0900, cxDesc=0x7ff429001c60, cx=0x7ff1a9cf0900, wDesc=0x7ff1230fbfd0, w=0x7ff1a150d500, yDesc=0x7ff429006c40, y=0x7ff1a9cf4900, hyDesc=0x7ff168009dc0, hy=0x7ff1a9e1c900, cyDesc=0x7ff429001c60, cy=0x7ff1a9e20900,
kDesc=0x0, keys=0x0, cDesc=0x0, cAttn=0x0, iDesc=0x0, iAttn=0x0, qDesc=0x0, queries=0x0, workSpace=0x7ff1a9e24900, workSpaceSizeInBytes=139821088, reserveSpace=0x7ff1b237ca00, reserveSpaceSizeInBytes=6062080) at ./tensorflow/stream_executor/cuda/cudnn_7_6.inc:2307
2307 in ./tensorflow/stream_executor/cuda/cudnn_7_6.inc
(gdb)
Continuing.
[Switching to Thread 0x7ff74effd700 (LWP 209661)]
Thread 528 "DeepSpeech.py" hit Breakpoint 1, 0x00007ff981aa0d50 in cudnnRNNForwardTrainingEx () from /home/alexandre/Documents/codaz/Mozilla/DeepSpeech/CUDA-10.0/lib64/libcudnn.so.7
(gdb)
Continuing.
[Switching to Thread 0x7ff72effd700 (LWP 209668)]
Thread 535 "DeepSpeech.py" hit Breakpoint 1, 0x00007ff981aa0d50 in cudnnRNNForwardTrainingEx () from /home/alexandre/Documents/codaz/Mozilla/DeepSpeech/CUDA-10.0/lib64/libcudnn.so.7
(gdb)
Continuing.
2020-07-20 14:43:04.648417: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-20 14:43:04.648554: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048]
Bummer. If I read: https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_7xx.html#rel_713 , it seems there have been LSTM related issues before hanging on specific sizes, in this case of the hidden state. But that was already fixed in all the cudnn versions we tested. I still can't wrap my head around why the TF 14 image seems to behave differently, you kind of ruled out the cudnn version. There also have been some changes to TF contrib/cudnn_rnn between v1.14 and v1.15, but my limited insights couldn't spot anything very amiss: https://github.com/tensorflow/tensorflow/commits/r1.15/tensorflow/contrib/cudnn_rnn
There also have been some changes to TF contrib/cudnn_rnn between v1.14 and v1.15, but my limited insights couldn't spot anything very amiss:
I can always try and git bisect
that ...
First would be to check if a custom build TF14 doesn't have the problem (with the 7.4.1.5 cudnn and/or the newest). If so it would point to a change in TF, if not ... nah don't think about that yet ..
First would be to check if a custom build TF14 doesn't have the problem (with the 7.4.1.5 cudnn and/or the newest). If so it would point to a change in TF, if not ... nah don't think about that yet ..
yeah that's what I'm doing ...
Ok, passes with 1.14.1 + CUDNN 7.6 built locally. But a few patches are required, this is going to make git bisect
slower than I would have loved.
3c6e3868ac14fdbcaa24ddfb05624a0b55f60263 is the first bad commit
commit 3c6e3868ac14fdbcaa24ddfb05624a0b55f60263
Author: Ayush Dubey <ayushd@google.com>
Date: Wed Aug 14 13:19:26 2019 -0700
Ensure that an error is returned if a collective op runs with int32 on GPU.
This change fixes a bug that would overwrite the error status with an OK status
and cause a hang downstream. It also adds a test that covers this scenario.
PiperOrigin-RevId: 263414497
.../common_runtime/base_collective_executor.cc | 15 +++++++-------
tensorflow/python/ops/collective_ops_gpu_test.py | 23 ++++++++++++++++++++++
2 files changed, 30 insertions(+), 8 deletions(-)
3c6e3868ac14fdbcaa24ddfb05624a0b55f60263 is the first bad commit commit 3c6e3868ac14fdbcaa24ddfb05624a0b55f60263 Author: Ayush Dubey <ayushd@google.com> Date: Wed Aug 14 13:19:26 2019 -0700 Ensure that an error is returned if a collective op runs with int32 on GPU. This change fixes a bug that would overwrite the error status with an OK status and cause a hang downstream. It also adds a test that covers this scenario. PiperOrigin-RevId: 263414497 .../common_runtime/base_collective_executor.cc | 15 +++++++------- tensorflow/python/ops/collective_ops_gpu_test.py | 23 ++++++++++++++++++++++ 2 files changed, 30 insertions(+), 8 deletions(-)
That seems like a weird bad commit, I'll verify that tomorrow ...
3c6e3868ac14fdbcaa24ddfb05624a0b55f60263 is the first bad commit commit 3c6e3868ac14fdbcaa24ddfb05624a0b55f60263 Author: Ayush Dubey <ayushd@google.com> Date: Wed Aug 14 13:19:26 2019 -0700 Ensure that an error is returned if a collective op runs with int32 on GPU. This change fixes a bug that would overwrite the error status with an OK status and cause a hang downstream. It also adds a test that covers this scenario. PiperOrigin-RevId: 263414497 .../common_runtime/base_collective_executor.cc | 15 +++++++------- tensorflow/python/ops/collective_ops_gpu_test.py | 23 ++++++++++++++++++++++ 2 files changed, 30 insertions(+), 8 deletions(-)
That seems like a weird bad commit, I'll verify that tomorrow ...
And yet, r1.15 and reverting this commit no more issue. So, is this bugged or is this exposing a long-standing issue ? On our side or tensorflow or CUDNN ?
Hmm weird and a small sigh, hoped that it would have delivered a more clear an pinpointed problem ... Any idea where this op would be used in the context of DeepSpeech and the max_sequence_length array and or the hidden state ? Perhaps it would be wise to try to get some help from TF people / Nvidia based on this ? We do have a commit and some docker test cases with data that triggers the issue.
Hmm weird and a small sigh, hoped that it would have delivered a more clear an pinpointed problem ...
I would have hoped as well
Any idea where this op would be used in the context of DeepSpeech and the max_sequence_length array and or the hidden state ?
Absolutely none. But it's interesting, because in the past we had to hack a thing: https://github.com/tensorflow/tensorflow/issues/20369 it might be a long shot, but DT_INT32
+ GPU also appears here.
Perhaps it would be wise to try to get some help from TF people / Nvidia based on this ?
That's the next step yeah, I'd like to limit as much as possible the repro steps and summup them. I still have not been able to get a clear understanding of the triggering condition, though, because previous hacking changing the feature len value from offending 75 to other value, I could have valid passes with 75. So it's not really crystal clear to me that the issue is this specific value, and I need to qualify better what is happening here.
3c6e3868ac14fdbcaa24ddfb05624a0b55f60263 is the first bad commit commit 3c6e3868ac14fdbcaa24ddfb05624a0b55f60263 Author: Ayush Dubey <ayushd@google.com> Date: Wed Aug 14 13:19:26 2019 -0700 Ensure that an error is returned if a collective op runs with int32 on GPU. This change fixes a bug that would overwrite the error status with an OK status and cause a hang downstream. It also adds a test that covers this scenario. PiperOrigin-RevId: 263414497 .../common_runtime/base_collective_executor.cc | 15 +++++++------- tensorflow/python/ops/collective_ops_gpu_test.py | 23 ++++++++++++++++++++++ 2 files changed, 30 insertions(+), 8 deletions(-)
That seems like a weird bad commit, I'll verify that tomorrow ...
And yet, r1.15 and reverting this commit no more issue. So, is this bugged or is this exposing a long-standing issue ? On our side or tensorflow or CUDNN ?
Bad news: it seems the issue is somehow intermittent, and after a few retry with this reverted, it's back and still here ...
I will restart bisection then and run it multiple times before calling good / bad ...
That's the next step yeah, I'd like to limit as much as possible the repro steps and summup them. I still have not been able to get a clear understanding of the triggering condition, though, because previous hacking changing the feature len value from offending 75 to other value, I could have valid passes with 75. So it's not really crystal clear to me that the issue is this specific value, and I need to qualify better what is happening here.
I agree, because batch C also gives 75 and that also passes.
What I am also wondering about is that how it could work by artificially limiting the max_sequence_length, since the data that we feed it self isn't changed. (so I would have expected it to blow up, because the sequences now seem longer than the max_sequence_length, or does it just not process the last bit of (padded or non-padded data in which the culprit lies ?)
Found some dicussions around this whole padding topic with @Reuben posting there: https://github.com/tensorflow/tensorflow/issues/23269 https://github.com/mozilla/DeepSpeech/issues/885
A commit in TF 1.15-rc0 seemed also seemed more interessting than the bisection came up with: https://github.com/tensorflow/tensorflow/commit/9380a41290e8fb8b9ea85f614472deab56dbc481#diff-8e54a26c3d435aad346bfa12f4c6ec79
Another interesting DS change mingling with the batches could be: https://github.com/mozilla/DeepSpeech/commit/6b1d6773de25aaf1c1c157f8c11ecdd727f00c6d
Especially the lines, I can't see any changes or explanation in the usage of returned values from create_dataset(), so why are the output_types changed ?: https://github.com/mozilla/DeepSpeech/commit/6b1d6773de25aaf1c1c157f8c11ecdd727f00c6d#diff-2f5b069cc3a96ce123ef7356642acb29R143-R145 But I'm not that familiar with the code, so likely I'm missing something. EDIT: hmm seems I was able to miss the map() a few lines below and the changes to entry_to_features().
What I am also wondering about is that how it could work by artificially limiting the max_sequence_length, since the data that we feed it self isn't changed. (so I would have expected it to blow up, because the sequences now seem longer than the max_sequence_length, or does it just not process the last bit of (padded or non-padded data in which the culprit lies ?)
There should be tensorflow code that already takes care of that. Now, maybe, for some reason, it's not working as expected in this case? Anyway, in the current state, we have not yet any criterion to do so.
Interesting. I'm re-doing bisect, with more runs on each test to ensure I avoid any intermittent behavior. Maybe with some luck, this will pop. (unfortunately your direct link just gives me PR, not the direct diff you expected, so I'm not sure what part of the PR you mean)
Another interesting DS change mingling with the batches could be: 6b1d677
Have you experimented before / after this commit ?
Re-doing bisect yields:
24297a4cb9120351643f7ac3916e7398236ccc0d is the first bad commit
commit 24297a4cb9120351643f7ac3916e7398236ccc0d
Author: Kaixi Hou <kaixih@nvidia.com>
Date: Fri Jul 19 13:41:25 2019 -0700
use padded IO for cudnn rnn only when necessary
tensorflow/core/kernels/cudnn_rnn_ops.cc | 42 +++++++++++++++++-----
tensorflow/stream_executor/cuda/cuda_dnn.cc | 13 ++++---
tensorflow/stream_executor/cuda/cuda_dnn.h | 3 +-
tensorflow/stream_executor/dnn.h | 4 ++-
.../stream_executor/stream_executor_pimpl.cc | 5 +--
tensorflow/stream_executor/stream_executor_pimpl.h | 3 +-
6 files changed, 52 insertions(+), 18 deletions(-)
https://github.com/tensorflow/tensorflow/commit/24297a4cb9120351643f7ac3916e7398236ccc0d https://github.com/tensorflow/tensorflow/pull/30889
I'll see how much that holds.
Re-doing bisect yields:
24297a4cb9120351643f7ac3916e7398236ccc0d is the first bad commit commit 24297a4cb9120351643f7ac3916e7398236ccc0d Author: Kaixi Hou <kaixih@nvidia.com> Date: Fri Jul 19 13:41:25 2019 -0700 use padded IO for cudnn rnn only when necessary tensorflow/core/kernels/cudnn_rnn_ops.cc | 42 +++++++++++++++++----- tensorflow/stream_executor/cuda/cuda_dnn.cc | 13 ++++--- tensorflow/stream_executor/cuda/cuda_dnn.h | 3 +- tensorflow/stream_executor/dnn.h | 4 ++- .../stream_executor/stream_executor_pimpl.cc | 5 +-- tensorflow/stream_executor/stream_executor_pimpl.h | 3 +- 6 files changed, 52 insertions(+), 18 deletions(-)
I'll see how much that holds.
5 runs of a r1.15 build without this patch works like a charm on the repro case. I'm running 20 more, but if that holds, it means we actually have something much more actionable now.
Ahh this one does sound related :+1:
For support and discussions, please use our Discourse forums.
If you've found a bug, or have a feature request, then please create an issue with the following information:
set -xe
apt-get install -y python3-venv libopus0
python3 -m venv /tmp/venv
source /tmp/venv/bin/activate
pip install -U setuptools wheel pip
pip install .
pip uninstall -y tensorflow
pip install tensorflow-gpu==1.14
mkdir -p ../keep/summaries
data="${SHARED_DIR}/data" fis="${data}/LDC/fisher" swb="${data}/LDC/LDC97S62/swb" lbs="${data}/OpenSLR/LibriSpeech/librivox" cv="${data}/mozilla/CommonVoice/en_1087h_2019-06-12/clips" npr="${data}/NPR/WAMU/sets/v0.3"
python -u DeepSpeech.py \ --train_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/treino_filtered_alphabet.csv \ --dev_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/dev_filtered_alphabet.csv \ --test_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/teste_filtered_alphabet.csv \ --train_batch_size 12 \ --dev_batch_size 24 \ --test_batch_size 24 \ --scorer ~/projects/corpora/deepspeech-pretrained-ptbr/kenlm.scorer \ --alphabet_config_path ~/projects/corpora/deepspeech-pretrained-ptbr/alphabet.txt \ --train_cudnn \ --n_hidden 2048 \ --learning_rate 0.0001 \ --dropout_rate 0.40 \ --epochs 150 \ --noearly_stop \ --audio_sample_rate 8000 \ --save_checkpoint_dir ~/projects/corpora/deepspeech-fulltrain-ptbr \ --use_allow_growth \ --log_level 0
andre@andrednn:~/projects/DeepSpeech$ bash .compute_msprompts
tf.compat.v1.data.get_output_types(iterator)
. W0618 12:30:10.218584 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:347: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_types(iterator)
. WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_shapes(iterator)
. W0618 12:30:10.218781 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_shapes(iterator)
. WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_classes(iterator)
. W0618 12:30:10.218892 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_classes(iterator)
. WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:W0618 12:30:10.324707 139639980619584 lazy_loader.py:50] The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dt ype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a f uture version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326584 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype i s deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0618 12:30:10.401312 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. W0618 12:30:11.297271 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. 2020-06-18 12:30:11.458650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:05:00.0 2020-06-18 12:30:11.459790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:06:00.0 2020-06-18 12:30:11.460897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:09:00.0 2020-06-18 12:30:11.462003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:0a:00.0 2020-06-18 12:30:11.462041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-06-18 12:30:11.462071: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-06-18 12:30:11.462085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-06-18 12:30:11.462097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-06-18 12:30:11.462109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-06-18 12:30:11.462121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-06-18 12:30:11.462133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-18 12:30:11.470539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3 2020-06-18 12:30:11.470679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-18 12:30:11.470694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 1 2 3 2020-06-18 12:30:11.470699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N Y Y Y 2020-06-18 12:30:11.470703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1: Y N Y Y 2020-06-18 12:30:11.470707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2: Y Y N Y 2020-06-18 12:30:11.470710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3: Y Y Y N 2020-06-18 12:30:11.476196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.477355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.478490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.479608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute ca pability: 6.1) D Session opened. I Could not find best validating checkpoint. I Could not find most recent checkpoint. I Initializing all variables. 2020-06-18 12:30:12.233482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 I STARTING Optimization Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2020-06-18 12:30:14.672316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 Epoch 0 | Training | Elapsed Time: 0:00:16 | Steps: 33 | Loss: 18.239303 2 020-06-18 12:30:30.589204: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.param s_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), w orkspace.size(), reserve_space.opaque(), reserve_space.size())' 2020-06-18 12:30:30.589243: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_uni ts, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] Traceback (most recent call last): File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] [[tower_2/CTCLoss/_147]] 1 successful operations. 2 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "DeepSpeech.py", line 12, in
ds_train.run_script()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script
absl.app.run(main)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main
train()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 608, in train
trainloss, = run_set('train', epoch, train_init_op)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 568, in run_set
feed_dict=feed_dict)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_2/CTCLoss/_147]]
1 successful operations.
2 derived errors ignored.
Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3_1': File "DeepSpeech.py", line 12, in
ds_train.run_script()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script
absl.app.run(main)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main
train()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 487, in train gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 313, in get_tower_results avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 240, in calculate_mean_edit_distance_andloss logits, = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 191, in create_model output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 129, in rnn_impl_cudnn_rnn sequence_lengths=seq_length) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in call outputs = super(Layer, self).call(inputs, *args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in call outputs = call_fn(cast_inputs, *args, *kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper return converted_call(f, options, args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call return _call_unconverted(f, args, kwargs, options) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted return f(args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call training) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward seed=self._seed) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn outputs, output_h, outputc, , _ = gen_cudnn_rnn_ops.cudnn_rnnv3(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3 time_major=time_major, name=name) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()