Closed andrenatal closed 4 years ago
I also did get similar errors lately. In my case it often occurs at the end of an epoch. Training works normally for a few epochs before i get the error. Mine has some different numbers than yours:
Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 1101, 30, 2048]
[[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
[[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]
Reducing the batch size helped me to get this error later in the training, this may be a workaround you can try.
I have tried reducing the batch size, but to no avail.
On Fri, Jun 19, 2020, 5:38 AM DanBmh notifications@github.com wrote:
I also did get similar errors lately. In my case it often occurs at the end of an epoch. Training works normally for a few epochs before i get the error. Mine has some different numbers than yours:
Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 1101, 30, 2048] [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]] [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]
Reducing the batch size helped me to get this error later in the training, this may be a workaround you can try.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mozilla/DeepSpeech/issues/3088#issuecomment-646613063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNUTGIYIGEXRHOSPTIC6TRXNL4TANCNFSM4OCC3PYQ .
@andrenatal What version of CuDNN version you are using? Currently with TensorFlow 1.15 depends on CUDA 10.0 and CuDNN v7.6.
I tried all versions that @reuben suggested, including CuDNN 7.6
@andrenatal I know you already tested a lot of things, but this forum entry is interesting: https://forums.developer.nvidia.com/t/gpu-crashes-when-running-machine-learning-models/108252
Can you give it a spin with Python 3.7 ?
We tried running it with Python 3.7 but we faced the same error.
We tried running it with Python 3.7 but we faced the same error.
Then I'm sorry but the only way to get something actionable is bisecting on the dataset to identify the offending files and debug from there.
@lissyx As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.
But because if it fails it always consistently fails on the same step and thus batch I tried to isolate stuff. I know have a small subset of my large dataset and that always fails on epoch 27 with batch size 32, so it's under 1500 samples and thus manageable in size.
I made some discoveries though:
So it seems that the combination (and probably order) of certain samples in a batch blows up with CUDNN consistently (and in any other combination or order, they don't).
I think the dataset subset is small enough to provide you with (around 20mb of samples), if that could help you determine as to why it actually blows up. (and provide the docker build script, run script, logging and the patches I applied to the v0.7.4 tree (only the printing of files in the batches and replacing the sort with the shuffle).
I think the dataset subset is small enough to provide you with (around 20mb of samples), if that could help you determine as to why it actually blows up.
If it's a bug in TensorFlow / CUDNN, it's hardly something we can help about. I'm already lacking time for a lot of other urgents matters, and it seems you have more background and knowledge on the issue than I do ...
* So I tried with the sorting from the sample loading replaced with a random.shuffle(), and training with CUDNN now doesn't blow up. Even with the whole dataset (about 280000 samples).
It would still be interesting if you could share the order when it works, when it fails, and where it fails.
I think the dataset subset is small enough to provide you with (around 20mb of samples), if that could help you determine as to why it actually blows up.
If it's a bug in TensorFlow / CUDNN, it's hardly something we can help about. I'm already lacking time for a lot of other urgents matters, and it seems you have more background and knowledge on the issue than I do ...
Merely reduced the problem-space, not of the tensorflow / deepspeech internals. And it would be nice if people could confirm (so it can be semi-worked around by not sorting).
Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ? (I ask, since the whole chain of cuda 10, tensorflow 1.15 etc. is probably unsupported by Nvidia as well, so we probably won't get any support from that side as well. And as there are several people now reporting issues with training on current deepspeech in this thread ...)
* So I tried with the sorting from the sample loading replaced with a random.shuffle(), and training with CUDNN now doesn't blow up. Even with the whole dataset (about 280000 samples).
It would still be interesting if you could share the order when it works, when it fails, and where it fails.
What would you like to have shared, only the csv or also the samples (as I think it would be somewhere in the samples and not the transcripts (but of course I could be wrong) ?
Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?
@reuben Had a look at that, he knows better.
What would you like to have shared, only the csv or also the samples (as I think it would be somewhere in the samples and not the transcripts (but of course I could be wrong) ?
I think you should need to share audio + csv
Merely reduced the problem-space, not of the tensorflow / deepspeech internals. And it would be nice if people could confirm (so it can be semi-worked around by not sorting).
Sure, but given the current workload, I really cannot promise having time to reproduce that: I am still lagging behind a lot of other super-urgents matters, sadly (thank you covid-19).
Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?
Lots
Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?
Lots That is unfortunate.
Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?
@reuben Had a look at that, he knows better.
What would you like to have shared, only the csv or also the samples (as I think it would be somewhere in the samples and not the transcripts (but of course I could be wrong) ?
I think you should need to share audio + csv
OK, will do.
Merely reduced the problem-space, not of the tensorflow / deepspeech internals. And it would be nice if people could confirm (so it can be semi-worked around by not sorting).
Sure, but given the current workload, I really cannot promise having time to reproduce that: I am still lagging behind a lot of other super-urgents matters, sadly (thank you covid-19).
OK, I will do some more experiments then, try to pinpoint it some more. Try to find out if only the batch content matters, or also the state the graph /weights are in from the previous steps. If only the batch content matters, I will test what happens if you only shuffle that.
@lissyx @reuben
Got the results of my extended testing based on a minimalistic dataset of 3x 32 samples, as I use a batchsize of 32, that is 3 steps. I named the batches A, B and C and as a whole they are ordered by wav_filesize.
I have done runs with all sorts of combinations of these batches (concatenated in the order of the name of the csv file), if appended with a "s" that batch in itself is still ordered by wav_filesize, if appended with a "r" that batch is randomly shuffled. The runs do 3 epochs.
In the tar.gz file I included:
As a summary of the results:
train_debug_Ar_Br_Cr.csv, blows up in step 1, which is batch B train_debug_Ar_Br_Cs.csv, blows up in step 1, which is batch B train_debug_Ar_Bs_Cs.csv, blows up in step 1, which is batch B train_debug_As_Br_Cs.csv, blows up in step 1, which is batch B train_debug_As_Bs_Cs.csv, blows up in step 1, which is batch B train_debug_As_Cs_Bs.csv, blows up in step 2, which is batch B train_debug_As_Cs.csv, OK train_debug_Bs_Cs.csv, OK train_debug_Cs_As_Bs.csv, blows up in step 2, which is batch B train_debug_Cs_As.csv, OK train_debug_Cs_Bs_As.csv, blows up in step 1, which is batch B train_debug_Cs_Bs.csv, blows up in step 1, which is batch B train_debug_interbatch_random: All variants: OK
My interpretation of these results:
But what is so special about the content of batch B that it blows up with CUDNN ...
(before you ask, it is not only this batch B, there are multiple such batches in my large datasets, this is one example with the shortest samples) deepspeech_v0.7.4_cudnn_debug.tar.gz
Nice @applied-machinelearning. Do you think you could even reduce batch B to a smaller set of files ? Maybe if we can know which file(s) triggers the behavior it might be easier to know about / check ?
Nice @applied-machinelearning. Do you think you could even reduce batch B to a smaller set of files ? Maybe if we can know which file(s) triggers the behavior it might be easier to know about / check ?
I could try by reducing the training batch size, see if I can find even smaller batches that fail (from previous tests I think it will end at either 2 or 4 (but not 1), will give it a try tomorrow.
As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.
So I see you are basically reusing the TensorFlow official Docker image and you got inspiration from https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train :)
That's good, that should make it easier for us to try and reproduce locally. Can you share more details on your actual underlying system and hardware, in case it might be related?
@lissyx @reuben
OK I have done some more runs:
I ran train_debug_As_Bs_Cs.csv with batch sizes 1 and 2:
Batch size 1 trains fine. Batch size 2 blows up on the step with files: B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav
So I made some new csv files with:
batch A: two files from the original batch A batch B: two files B/98_2923 and B/154_4738 from batch B batch C: two files from the original batch C
And I made some variant of that:
train_debug_mini_As_Bs_Cs.csv train_debug_mini_Bs_As_Cs.csv train_debug_mini_Bs_As_Cs_B_swapped.csv train_debug_mini_As_Bs_Cs_B_swapped.csv train_debug_mini_As_Bs_Cs_B_mixed_A.csv train_debug_mini_As_Bs_Cs_B_mixed_C.csv train_debug_mini_As_Bs_Cs_B_mixed_C_2.csv train_debug_mini_As_Bs_Cs_B_swapped_C_mixed.csv
The results of that:
With batch size 1, these all workout fine (as expected). With batch size 2: train_debug_mini_As_Bs_Cs.csv blows up in step 1, which is batch B. train_debug_mini_As_Bs_Cs_B_swapped.csv blows up in step 1, which is batch B, so swapping the order within B doesn't make a difference. train_debug_mini_Bs_As_Cs.csv works fine, B is the first step 0. as expected as the first step seems to be a special case. train_debug_mini_Bs_As_Cs_B_swapped.csv works fine, B is the first step 0, so swapping the order in B doesn't make a difference. as expected as the first step seems to be a special case. train_debug_mini_As_Bs_Cs_B_mixed_A.csv blows up in step 1, which is: A/155_4757 B/154_4738 train_debug_mini_As_Bs_Cs_B_mixed_C.csv blows up in step 1, which is: B/98_2923 C/169_5271 train_debug_mini_As_Bs_Cs_B_mixed_C_2.csv blows up in step 1, which is: C/169_5271 B/98_2923 train_debug_mini_As_Bs_Cs_B_swapped_C_mixed.csv blows up in step 2, which is: B/98_2923 C/169_5271 while it did complete step 1, which is: B/154_4738 C/175_5429
My interpretation of this all:
So it is a bit odd, I'm starting to wonder if this is some edge case where we hit some math operation blowing up. But both files from B have slightly different file sizes and both blow up in combinations with other files with slightly different file sizes (from A and C).
So I'm a bit lost now, you have more insight in how things get processed, hopefully you have some more ideas based on that.
CSV's and logs are attached (sample files from the previous post can be used) train_debug_mini.tar.gz
As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.
So I see you are basically reusing the TensorFlow official Docker image and you got inspiration from https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train :) I think it was an Italian DS/CV repo I drew inspiration from, but they probably took it from the French one ;). Previously I also tried with a docker build with ubuntu18.04-cuda10 image as a base, with tensorflow-gpu 1.15.3.
That's good, that should make it easier for us to try and reproduce locally. Can you share more details on your actual underlying system and hardware, in case it might be related?
Host is an AMD Ryzen system with 32GB of mem and a GTX 1070 with 8GB of mem, running Debian. Host Nvidia driver is now 440.100 (but I have tried several others, still the same problems). If you need more specifics, please indicate what info you need more.
Thanks for looking into it !
As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.
So I see you are basically reusing the TensorFlow official Docker image and you got inspiration from https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train :) I think it was an Italian DS/CV repo I drew inspiration from, but they probably took it from the French one ;). Previously I also tried with a docker build with ubuntu18.04-cuda10 image as a base, with tensorflow-gpu 1.15.3.
That's good, that should make it easier for us to try and reproduce locally. Can you share more details on your actual underlying system and hardware, in case it might be related?
Host is an AMD Ryzen system with 32GB of mem and a GTX 1070 with 8GB of mem, running Debian. Host Nvidia driver is now 440.100 (but I have tried several others, still the same problems). If you need more specifics, please indicate what info you need more.
Thanks for looking into it !
Thanks, running Sid as well here, so I'm on similar setup, except I have 2x (faster, more memory) GPUs. I hope it will still allow me to repro.
Thanks, running Sid as well here, so I'm on similar setup, except I have 2x (faster, more memory) GPUs. I hope it will still allow me to repro.
I'm running buster on that machine, when I woke up this morning it dawned on me I forgot to post the hyperparameter stuff. So attached is the script I used in the docker container to run the tests. Feature cache, checkpoint dir etc, all get cleaned up before the run.
run_deepspeech_var_batchsize.sh.tar.gz
I hope you can reproduce and spot something !
Thanks, running Sid as well here, so I'm on similar setup, except I have 2x (faster, more memory) GPUs. I hope it will still allow me to repro.
I'm running buster on that machine, when I woke up this morning it dawned on me I forgot to post the hyperparameter stuff. So attached is the script I used in the docker container to run the tests. Feature cache, checkpoint dir etc, all get cleaned up before the run.
run_deepspeech_var_batchsize.sh.tar.gz
I hope you can reproduce and spot something !
Looks like clean.sh
is missing, as well as FATAL Flags parsing error: flag --alphabet_config_path=./data/lm/plaintext_alpha.txt: The file pointed to by --alphabet_config_path must exist and be readable.
. I don't want to sound rude, but if you could just assemble a dump-proof Docker or script to repro minimally the issue, there are already enough complexity and variables interacting, I really need to be 1000% sure to repro your exact step to assert whether I can reproduce the issue :/
I'm not even able to get CUDA working so far in the dockerfile :/
I'm not even able to get CUDA working so far in the dockerfile :/
Seems to be the same old weird nvidia/cuda/docker bug, after ldconfig
it works:
tf-docker ~ > sudo ldconfig
tf-docker ~ > nvidia-smi
Thu Jul 9 10:16:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:21:00.0 Off | N/A |
| 0% 34C P8 1W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:4B:00.0 Off | N/A |
| 0% 35C P8 20W / 250W | 0MiB / 11019MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
tf-docker ~ > python -c "import tensorflow as tf; tf.test.is_gpu_available()"
2020-07-09 10:16:48.233166: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-09 10:16:48.264242: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900325000 Hz
2020-07-09 10:16:48.271101: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5d55f00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-09 10:16:48.271144: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2020-07-09 10:16:48.272884: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-09 10:16:54.029647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.046529: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.047194: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5d58840 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-07-09 10:16:54.047218: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-07-09 10:16:54.047253: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-07-09 10:16:54.047656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.048468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:21:00.0
2020-07-09 10:16:54.048551: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.049324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:4b:00.0
2020-07-09 10:16:54.049585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-09 10:16:54.057643: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-09 10:16:54.061562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-09 10:16:54.066658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-09 10:16:54.077684: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-09 10:16:54.081287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-09 10:16:54.107985: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-09 10:16:54.108254: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.109206: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.110043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.110885: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.111644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1
2020-07-09 10:16:54.111707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-09 10:16:54.113783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-09 10:16:54.113802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 1
2020-07-09 10:16:54.113811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N N
2020-07-09 10:16:54.113821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1: N N
2020-07-09 10:16:54.113979: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.114808: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.115627: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.116444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 10311 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:21:00.0, compute capability: 7.5)
2020-07-09 10:16:54.117023: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.117508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:1 with 10311 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:4b:00.0, compute capability: 7.5)
@applied-machinelearning Good news, I repro your issue.
@applied-machinelearning Not only I repro, but apt update && apt upgrade
changes the issue: first it was exploding at epoch 1, now at epoch 2.
Several people report similar issue with NVIDIA drivers above a certain version: https://github.com/tensorflow/tensorflow/issues/35950#issuecomment-577427083, and 431.36 would be a working one.
Fun: gpu_options=tfv1.GPUOptions(per_process_gpu_memory_fraction=0.05)
triggers the issue at the very begining
Looks like
clean.sh
is missing, as well asFATAL Flags parsing error: flag --alphabet_config_path=./data/lm/plaintext_alpha.txt: The file pointed to by --alphabet_config_path must exist and be readable.
. I don't want to sound rude, but if you could just assemble a dump-proof Docker or script to repro minimally the issue, there are already enough complexity and variables interacting, I really need to be 1000% sure to repro your exact step to assert whether I can reproduce the issue :/
Sorry for that, didn't expect you to run it literally.
Several people report similar issue with NVIDIA drivers above a certain version: tensorflow/tensorflow#35950 (comment), and 431.36 would be a working one.
Thanks for figuring this out, didn't come up with my google-foo.
Hmm I will see if can give that driver a shot this evening, although I can't find 431.36 in the download archive at https://www.nvidia.com/en-us/drivers/unix/linux-amd64-display-archive/
This one seems to be the closest:
Version: 430.64 Operating System: Linux 64-bit Release Date: November 5, 2019
And it probably means downgrading the kernel as well to something semi-ancient :( (edit: hmm from the description it should compile with kernel 5.4, not too ancient)
Version: 430.64 Operating System: Linux 64-bit Release Date: November 5, 2019
And it probably means downgrading the kernel as well to something semi-ancient :(
On Buster you might have more chances to succeed compared to me on Sid.
On Buster you might have more chances to succeed compared to me on Sid. Kernels should be fairly independent of the rest of the system.
Are you going to / do you know, the best way to address this with Nvidia ? (seems the problem itself has been noted for quite some time without a fix appearing in newer drivers)
On Buster you might have more chances to succeed compared to me on Sid. Kernels should be fairly independent of the rest of the system.
Are you going to / do you know, the best way to address this with Nvidia ? (seems the problem itself has been noted for quite some time without a fix appearing in newer drivers)
I have no idea ?
There are some hints on some of the reports it might be related to the ordering of sequence_length
, i'd like to get a better grasp at that, confirm and so maybe we could at least have some tooling / workaround to help about that.
@applied-machinelearning For fun, at some point, some combination of dataset, driver and tensorflow version on our codebase would trigger a power surge on my hardware at home, and it was too much for my PSU that was shutting down :/
@applied-machinelearning While not a workaround I like, but it seems to help moving forward that changing to TensorFlow 1.14 gets me through the small example
Is it something you could test on full / repro dataset on your side ?
FROM tensorflow/tensorflow:1.14.0-gpu-py3
Sure will test that before changing the driver.
Sure will test that before changing the driver.
Like, I'm not sure if it's not just a side-effect that different tensorflow version might schedule things differently, as you said it was a point that matters, or if it's because it depends on cudnn 7.4 instead of 7.6 and it might behave differently on that point.
Hmm a bit busy and tired this evening, so I will postpone most testing till tomorrow, but I have done some tests with tensorflow/tensorflow:1.14.0-gpu-py3 and the 440.100 driver (the one I used with the failing tf1.15 image tests as well).
Done all tests except the full dataset one (so 1500, the 3x32 batches and the 3x2 batches) and all succeed with the tf-1.14 image, so I think you are correct. Still debatable if its TF or cudnn, but if I would have to bet, I would bet on the different cudnn version.
Will test the driver downgrade tomorrow and after that a run on the full dataset.
Hmm a bit busy and tired this evening, so I will postpone most testing till tomorrow, but I have done some tests with tensorflow/tensorflow:1.14.0-gpu-py3 and the 440.100 driver (the one I used with the failing tf1.15 image tests as well).
Done all tests except the full dataset one (so 1500, the 3x32 batches and the 3x2 batches) and all succeed with the tf-1.14 image, so I think you are correct. Still debatable if its TF or cudnn, but if I would have to bet, I would bet on the different cudnn version.
Will test the driver downgrade tomorrow and after that a run on the full dataset.
OK, good to know we make progress. I'm trying to check how sequence_length variations are related
I ran the test with different drivers, preliminary results (will do a long test after this):
Nvidia host driver | docker base image | short tests | long test |
---|---|---|---|
440.100 | tensorflow/tensorflow:1.14.0-gpu-py3 | worked | |
440.100 | tensorflow/tensorflow:1.15.2-gpu-py3 | failed | |
430.64 | tensorflow/tensorflow:1.14.0-gpu-py3 | worked | |
430.64 | tensorflow/tensorflow:1.15.2-gpu-py3 | failed | |
450.57 | tensorflow/tensorflow:1.14.0-gpu-py3 | worked | worked |
450.57 | tensorflow/tensorflow:1.15.2-gpu-py3 | failed | failed |
440.100 was the driver I was using originally. 430.64 the driver downloadable just below the 431.36 that was reported as working on the TF forum (could be the versioning of Nvidia is different so it is actually not below 431.36, but it was my best guess). 450.57 the latest stable driver released yesterday.
So from this I would take that the host driver version doesn't matter. And I haven't been able to prove that the TF14 image doesn't work :)
Will start a long test now with the TF14 image.
I just verified and I repro with cudnn v7.6.1 as well. I think I should try and rebuild tf 1.15.2 docker with cudnn 7.6, 7.5 and 7.4 to assert here.
Updated the table above, I think I'm convinced enough to say that the TF14 image doesn't have the problem. Hope you succeed in pinning it to a particular cudnn version.
Ok, required a bit of hacking but I leveraged TensorFlow's CI builds script to produce some 1.15.2 CUDA-enabled python 3.6 wheels with different cudnn7 linkage, currently I have 7.4 done, 7.5 in progress and soon finished. Next step are:
So, TensorFlow r1.15.2 CUDA 10.0, with host driver 440.100:
So, TensorFlow r1.15.2 CUDA 10.0, with host driver 440.100:
* libcudnn 7.6.5.32: fail * libcudnn 7.5.1.10: fail * libcudnn 7.4.2.24: success
To build the TensorFlow wheels:
git clone https://github.com/tensorflow/tensorflow --branch r1.15
tensorflow/tools/ci_build/
run:
CI_DOCKER_BUILD_EXTRA_PARAMS="--build-arg cudnn=7.4.2.24" CI_DOCKER_EXTRA_PARAMS="-e CI_BUILD_PYTHON=python3.6" BUILD_TAG=tf-py3-cudnn-7.4.2.24 ./ci_build.sh gpu tensorflow/tools/ci_build/builds/pip.sh gpu -c opt --config=cuda
CI_DOCKER_BUILD_EXTRA_PARAMS="--build-arg cudnn=7.5.1.10" CI_DOCKER_EXTRA_PARAMS="-e CI_BUILD_PYTHON=python3.6" BUILD_TAG=tf-py3-cudnn-7.5.1.10 ./ci_build.sh gpu tensorflow/tools/ci_build/builds/pip.sh gpu -c opt --config=cuda
CI_DOCKER_BUILD_EXTRA_PARAMS="--build-arg cudnn=7.6.5.32" CI_DOCKER_EXTRA_PARAMS="-e CI_BUILD_PYTHON=python3.6" BUILD_TAG=tf-py3-cudnn-7.6.5.32 ./ci_build.sh gpu tensorflow/tools/ci_build/builds/pip.sh gpu -c opt --config=cuda
pip_test/whl/tensorflow_gpu-1.15.3-cp36-cp36m-manylinux1_x86_64.whl
To repro the issue:
docker build -f Dockerfile.deepspeech-v0.7.4-reduced --build-arg cudnn=7.4.2.24 . --tag issue3088:7.4.2.24
docker build -f Dockerfile.deepspeech-v0.7.4-reduced --build-arg cudnn=7.5.1.10 . --tag issue3088:7.5.1.10
docker build -f Dockerfile.deepspeech-v0.7.4-reduced --build-arg cudnn=7.6.5.32 . --tag issue3088:7.6.5.32
docker run --runtime=nvidia --rm issue3088:7.6.5.32
docker run --runtime=nvidia --rm issue3088:7.5.1.10
docker run --runtime=nvidia --rm issue3088:7.4.2.24
@applied-machinelearning It would be awesome if you could cross-check on your side, with just varying the cudnn version we limit the risks of the issue being just masked by different tensorflow version.
For support and discussions, please use our Discourse forums.
If you've found a bug, or have a feature request, then please create an issue with the following information:
set -xe
apt-get install -y python3-venv libopus0
python3 -m venv /tmp/venv
source /tmp/venv/bin/activate
pip install -U setuptools wheel pip
pip install .
pip uninstall -y tensorflow
pip install tensorflow-gpu==1.14
mkdir -p ../keep/summaries
data="${SHARED_DIR}/data" fis="${data}/LDC/fisher" swb="${data}/LDC/LDC97S62/swb" lbs="${data}/OpenSLR/LibriSpeech/librivox" cv="${data}/mozilla/CommonVoice/en_1087h_2019-06-12/clips" npr="${data}/NPR/WAMU/sets/v0.3"
python -u DeepSpeech.py \ --train_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/treino_filtered_alphabet.csv \ --dev_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/dev_filtered_alphabet.csv \ --test_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/teste_filtered_alphabet.csv \ --train_batch_size 12 \ --dev_batch_size 24 \ --test_batch_size 24 \ --scorer ~/projects/corpora/deepspeech-pretrained-ptbr/kenlm.scorer \ --alphabet_config_path ~/projects/corpora/deepspeech-pretrained-ptbr/alphabet.txt \ --train_cudnn \ --n_hidden 2048 \ --learning_rate 0.0001 \ --dropout_rate 0.40 \ --epochs 150 \ --noearly_stop \ --audio_sample_rate 8000 \ --save_checkpoint_dir ~/projects/corpora/deepspeech-fulltrain-ptbr \ --use_allow_growth \ --log_level 0
andre@andrednn:~/projects/DeepSpeech$ bash .compute_msprompts
tf.compat.v1.data.get_output_types(iterator)
. W0618 12:30:10.218584 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:347: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_types(iterator)
. WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_shapes(iterator)
. W0618 12:30:10.218781 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_shapes(iterator)
. WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_classes(iterator)
. W0618 12:30:10.218892 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_classes(iterator)
. WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:W0618 12:30:10.324707 139639980619584 lazy_loader.py:50] The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dt ype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a f uture version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326584 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype i s deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0618 12:30:10.401312 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. W0618 12:30:11.297271 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. 2020-06-18 12:30:11.458650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:05:00.0 2020-06-18 12:30:11.459790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:06:00.0 2020-06-18 12:30:11.460897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:09:00.0 2020-06-18 12:30:11.462003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:0a:00.0 2020-06-18 12:30:11.462041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-06-18 12:30:11.462071: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-06-18 12:30:11.462085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-06-18 12:30:11.462097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-06-18 12:30:11.462109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-06-18 12:30:11.462121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-06-18 12:30:11.462133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-18 12:30:11.470539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3 2020-06-18 12:30:11.470679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-18 12:30:11.470694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 1 2 3 2020-06-18 12:30:11.470699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N Y Y Y 2020-06-18 12:30:11.470703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1: Y N Y Y 2020-06-18 12:30:11.470707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2: Y Y N Y 2020-06-18 12:30:11.470710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3: Y Y Y N 2020-06-18 12:30:11.476196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.477355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.478490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.479608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute ca pability: 6.1) D Session opened. I Could not find best validating checkpoint. I Could not find most recent checkpoint. I Initializing all variables. 2020-06-18 12:30:12.233482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 I STARTING Optimization Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2020-06-18 12:30:14.672316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 Epoch 0 | Training | Elapsed Time: 0:00:16 | Steps: 33 | Loss: 18.239303 2 020-06-18 12:30:30.589204: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.param s_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), w orkspace.size(), reserve_space.opaque(), reserve_space.size())' 2020-06-18 12:30:30.589243: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_uni ts, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] Traceback (most recent call last): File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] [[tower_2/CTCLoss/_147]] 1 successful operations. 2 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "DeepSpeech.py", line 12, in
ds_train.run_script()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script
absl.app.run(main)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main
train()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 608, in train
trainloss, = run_set('train', epoch, train_init_op)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 568, in run_set
feed_dict=feed_dict)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_2/CTCLoss/_147]]
1 successful operations.
2 derived errors ignored.
Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3_1': File "DeepSpeech.py", line 12, in
ds_train.run_script()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script
absl.app.run(main)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main
train()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 487, in train gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 313, in get_tower_results avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 240, in calculate_mean_edit_distance_andloss logits, = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 191, in create_model output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 129, in rnn_impl_cudnn_rnn sequence_lengths=seq_length) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in call outputs = super(Layer, self).call(inputs, *args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in call outputs = call_fn(cast_inputs, *args, *kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper return converted_call(f, options, args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call return _call_unconverted(f, args, kwargs, options) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted return f(args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call training) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward seed=self._seed) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn outputs, output_h, outputc, , _ = gen_cudnn_rnn_ops.cudnn_rnnv3(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3 time_major=time_major, name=name) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()