mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.36k stars 3.97k forks source link

Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 31 , 12, 2048] #3088

Closed andrenatal closed 4 years ago

andrenatal commented 4 years ago

For support and discussions, please use our Discourse forums.

If you've found a bug, or have a feature request, then please create an issue with the following information:

set -xe

apt-get install -y python3-venv libopus0

python3 -m venv /tmp/venv

source /tmp/venv/bin/activate

pip install -U setuptools wheel pip

pip install .

pip uninstall -y tensorflow

pip install tensorflow-gpu==1.14

mkdir -p ../keep/summaries

data="${SHARED_DIR}/data" fis="${data}/LDC/fisher" swb="${data}/LDC/LDC97S62/swb" lbs="${data}/OpenSLR/LibriSpeech/librivox" cv="${data}/mozilla/CommonVoice/en_1087h_2019-06-12/clips" npr="${data}/NPR/WAMU/sets/v0.3"

python -u DeepSpeech.py \ --train_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/treino_filtered_alphabet.csv \ --dev_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/dev_filtered_alphabet.csv \ --test_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/teste_filtered_alphabet.csv \ --train_batch_size 12 \ --dev_batch_size 24 \ --test_batch_size 24 \ --scorer ~/projects/corpora/deepspeech-pretrained-ptbr/kenlm.scorer \ --alphabet_config_path ~/projects/corpora/deepspeech-pretrained-ptbr/alphabet.txt \ --train_cudnn \ --n_hidden 2048 \ --learning_rate 0.0001 \ --dropout_rate 0.40 \ --epochs 150 \ --noearly_stop \ --audio_sample_rate 8000 \ --save_checkpoint_dir ~/projects/corpora/deepspeech-fulltrain-ptbr \ --use_allow_growth \ --log_level 0


I'm getting the following error when using my ptbr 8khz dataset to train. Have tried to downgrade and upgrade cuda, cudnn, nvidia-drivers, and ubuntu (16 and 18) and the error persists. I have tried with datasets containing two different characteristics: 6s and 15s in length. Both contain audios in 8khz.

andre@andrednn:~/projects/DeepSpeech$ bash .compute_msprompts

W0618 12:30:10.324707 139639980619584 lazy_loader.py:50] The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dt ype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a f uture version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326584 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype i s deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0618 12:30:10.401312 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. W0618 12:30:11.297271 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. 2020-06-18 12:30:11.458650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:05:00.0 2020-06-18 12:30:11.459790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:06:00.0 2020-06-18 12:30:11.460897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:09:00.0 2020-06-18 12:30:11.462003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:0a:00.0 2020-06-18 12:30:11.462041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-06-18 12:30:11.462071: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-06-18 12:30:11.462085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-06-18 12:30:11.462097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-06-18 12:30:11.462109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-06-18 12:30:11.462121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-06-18 12:30:11.462133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-18 12:30:11.470539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3 2020-06-18 12:30:11.470679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-18 12:30:11.470694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 1 2 3 2020-06-18 12:30:11.470699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N Y Y Y 2020-06-18 12:30:11.470703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1: Y N Y Y 2020-06-18 12:30:11.470707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2: Y Y N Y 2020-06-18 12:30:11.470710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3: Y Y Y N 2020-06-18 12:30:11.476196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.477355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.478490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.479608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute ca pability: 6.1) D Session opened. I Could not find best validating checkpoint. I Could not find most recent checkpoint. I Initializing all variables. 2020-06-18 12:30:12.233482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 I STARTING Optimization Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2020-06-18 12:30:14.672316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 Epoch 0 | Training | Elapsed Time: 0:00:16 | Steps: 33 | Loss: 18.239303 2 020-06-18 12:30:30.589204: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.param s_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), w orkspace.size(), reserve_space.opaque(), reserve_space.size())' 2020-06-18 12:30:30.589243: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_uni ts, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] Traceback (most recent call last): File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] [[tower_2/CTCLoss/_147]] 1 successful operations. 2 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "DeepSpeech.py", line 12, in ds_train.run_script() File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script absl.app.run(main) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main train() File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 608, in train trainloss, = run_set('train', epoch, train_init_op) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 568, in run_set feed_dict=feed_dict) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[tower_2/CTCLoss/_147]] 1 successful operations. 2 derived errors ignored.

Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3_1': File "DeepSpeech.py", line 12, in ds_train.run_script() File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script absl.app.run(main) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main train()

File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 487, in train gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 313, in get_tower_results avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 240, in calculate_mean_edit_distance_andloss logits, = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 191, in create_model output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 129, in rnn_impl_cudnn_rnn sequence_lengths=seq_length) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in call outputs = super(Layer, self).call(inputs, *args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in call outputs = call_fn(cast_inputs, *args, *kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper return converted_call(f, options, args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call return _call_unconverted(f, args, kwargs, options) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted return f(args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call training) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward seed=self._seed) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn outputs, output_h, outputc, , _ = gen_cudnn_rnn_ops.cudnn_rnnv3(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3 time_major=time_major, name=name) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

applied-machinelearning commented 4 years ago

Great work @lissyx ! Will see if I can find some time this weekend to see if I can get that stuff to work .

Was 7.4.2.24 the last of the cudnn 7.4 versions ? That would suggest it was introduced somewhere in 7.5 series.

lissyx commented 4 years ago

Great work @lissyx ! Will see if I can find some time this weekend to see if I can get that stuff to work .

Was 7.4.2.24 the last of the cudnn 7.4 versions ? That would suggest it was introduced somewhere in 7.5 series.

It could still be in a lot of places, who knows exactly. But at least even if inconvenient it might help unblocking

applied-machinelearning commented 4 years ago

@lissyx

My results are different, logs are attached.:

lissyx commented 4 years ago

So maybe it was just luck or we lack another parameter. Please note we don't have the same GPUs, and so the same memory. Maybe it is why.

applied-machinelearning commented 4 years ago

On July 13, 2020 6:03:47 PM GMT+02:00, lissyx notifications@github.com wrote:

So maybe it was just luck or we lack another parameter. Please note we don't have the same GPUs, and so the same memory. Maybe it is why.

What we haven't tested, is TF14 builds with those cudnn version.

lissyx commented 4 years ago

On July 13, 2020 6:03:47 PM GMT+02:00, lissyx @.***> wrote: So maybe it was just luck or we lack another parameter. Please note we don't have the same GPUs, and so the same memory. Maybe it is why. What we haven't tested, is TF14 builds with those cudnn version.

I can reproduce the issue here on 7.4 as well by limiting visible GPU to only one (number 0 or 1). When I expose both, it works.

lissyx commented 4 years ago

@applied-machinelearning I found that hack to help locally, after getting more repro:

tf-docker ~/ds > git diff
diff --git a/training/deepspeech_training/util/feeding.py b/training/deepspeech_training/util/feeding.py
index 4c9b681d..4cddca22 100644
--- a/training/deepspeech_training/util/feeding.py
+++ b/training/deepspeech_training/util/feeding.py
@@ -48,7 +48,7 @@ def audio_to_features(audio, sample_rate, transcript=None, clock=0.0, train_phas
     if train_phase and augmentations is not None:
         features = apply_graph_augmentations('features', features, augmentations, transcript=transcript, clock=clock)

-    return features, tf.shape(input=features)[0]
+    return features, tf.shape(input=features)[0] - 1

 def audiofile_to_features(wav_filename, clock=0.0, train_phase=False, augmentations=None):

I can't explain yet why, and I'd like your feedback if you can corroborate if it helps on all your repro cases or not

applied-machinelearning commented 4 years ago

On July 13, 2020 6:03:47 PM GMT+02:00, lissyx @.***> wrote: So maybe it was just luck or we lack another parameter. Please note we don't have the same GPUs, and so the same memory. Maybe it is why. What we haven't tested, is TF14 builds with those cudnn version.

I can reproduce the issue here on 7.4 as well by limiting visible GPU to only one (number 0 or 1). When I expose both, it works.

Don't know the inner-workings of training on multi-gpu, but if it is interleaving the batches, with the very small test set, your second GPU could have batch B, but as that GPU's first step, so the special case could apply there. I wonder if it still works with multi-gpu if you repeat some of the other batches before batch B, so it will never be the first step of a GPU.

BTW last few days I have trained all my datasets on the image based on tensorflow/tensorflow:1.14.0-gpu-py3 and I haven't had a problem. The only issue is not being able to get "convert_graphdef_memmapped_format" via taskcluster, since that file is gone form the mozilla infrastructure for the 1.14 branch.

@applied-machinelearning I found that hack to help locally, after getting more repro: ... I can't explain yet why, and I'd like your feedback if you can corroborate if it helps on all your repro cases or not

I will give that a shot this evening. :)

lissyx commented 4 years ago

hacking the stride value also seems to do something (obviously, I have no idea why):

diff --git a/training/deepspeech_training/util/feeding.py b/training/deepspeech_training/util/feeding.py
index 4c9b681d..ae50e4f9 100644
--- a/training/deepspeech_training/util/feeding.py
+++ b/training/deepspeech_training/util/feeding.py
@@ -33,7 +33,7 @@ def audio_to_features(audio, sample_rate, transcript=None, clock=0.0, train_phas

     spectrogram = contrib_audio.audio_spectrogram(audio,
                                                   window_size=Config.audio_window_samples,
-                                                  stride=Config.audio_step_samples,
+                                                  stride=Config.audio_step_samples+1,
                                                   magnitude_squared=True)

     if train_phase and augmentations is not None:
lissyx commented 4 years ago

The only issue is not being able to get "convert_graphdef_memmapped_format" via taskcluster, since that file is gone form the mozilla infrastructure for the 1.14 branch.

You can just rebuild it, it's a bit time consuming but not complicated

Don't know the inner-workings of training on multi-gpu, but if it is interleaving the batches, with the very small test set, your second GPU could have batch B, but as that GPU's first step, so the special case could apply there. I wonder if it still works with multi-gpu if you repeat some of the other batches before batch B, so it will never be the first step of a GPU.

Yeah; but we still don't know what that special case here is

lissyx commented 4 years ago

FTR the offending call is at https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/stream_executor/cuda/cuda_dnn.cc#L1785-L1798

And that's directly within libcudnn7 :/

lissyx commented 4 years ago

Here also, report that driver v431.36 downgrade fixes the very similar error: https://stackoverflow.com/questions/62612226/tensorflow-check-failed-status-cudnn-status-success-7-vs-0failed-to-set-c

applied-machinelearning commented 4 years ago

@applied-machinelearning I found that hack to help locally, after getting more repro:

tf-docker ~/ds > git diff
diff --git a/training/deepspeech_training/util/feeding.py b/training/deepspeech_training/util/feeding.py
index 4c9b681d..4cddca22 100644
--- a/training/deepspeech_training/util/feeding.py
+++ b/training/deepspeech_training/util/feeding.py
@@ -48,7 +48,7 @@ def audio_to_features(audio, sample_rate, transcript=None, clock=0.0, train_phas
     if train_phase and augmentations is not None:
         features = apply_graph_augmentations('features', features, augmentations, transcript=transcript, clock=clock)

-    return features, tf.shape(input=features)[0]
+    return features, tf.shape(input=features)[0] - 1

 def audiofile_to_features(wav_filename, clock=0.0, train_phase=False, augmentations=None):

I can't explain yet why, and I'd like your feedback if you can corroborate if it helps on all your repro cases or not

So this one works for me as well.

I also printed the original shape, for batch B that is 75 which seems to match the 75 max_sequence_length in the exception from cudnn when things do crash.

train_debug_mini_As_Bs_Cs.log

applied-machinelearning commented 4 years ago

FTR the offending call is at https://github.com/tensorflow/tensorflow/blob/r1.15/tensorflow/stream_executor/cuda/cuda_dnn.cc#L1785-L1798

And that's directly within libcudnn7 :/

Yeah that was likely and unfortunately there is no such things as a neat error message.

Here also, report that driver v431.36 downgrade fixes the very similar error: https://stackoverflow.com/questions/62612226/tensorflow-check-failed-status-cudnn-status-success-7-vs-0failed-to-set-c

Hmmm dusted off my google-foo, but still could not find a linux download of v431.36. However, the release date (for windows at least) for v431.36 seems to be 07-09-2019. What I tested was 430.64 which is lower in version number, but later in release date: November 5, 2019.

So tomorrow I will see if I can test with: 430.40 which has release date: July 29, 2019, so both metrics are lower.

lissyx commented 4 years ago

I also printed the original shape, for batch B that is 75 which seems to match the 75 max_sequence_length in the exception from cudnn when things do crash.

I thought the same, but hacking and forcing +1 on features_len, the crash would happen on value 76, and previous values would become 75 without problem, it seems (also the error changed).

applied-machinelearning commented 4 years ago

OK, so I have tried extra drivers released before the infamous "v431.36": 410.93, 418.74, 418.88, 430.34. None of them works for me.

lissyx commented 4 years ago

I'm trying, but failing to so far, to build tf 1.15 pip with some debug enabled, outside of the docker setup they have, so I can at least get more insight on the offending call

applied-machinelearning commented 4 years ago

Hmm I finally figured out the probable cudnn version of the tensorflow/tensorflow:1.14.0-gpu-py3 image. According to: https://hub.docker.com/layers/tensorflow/tensorflow/1.14.0-gpu-py3/images/sha256-e72e66b3dcb9c9e8f4e5703965ae1466b23fe8cad59e1c92c6e9fa58f8d81dc8?context=explore it should be CUDA 10.0.130-1 with CUDNN 7.4.1.5-1. The lowest cudnn we checked with the images you build was: issue3088:7.4.2.24, could it be worth while to also check a build with 7.4.1 ? I don't see anything very obviously related in the cudnn release notes on https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_7xx.html#rel_742 though.

lissyx commented 4 years ago

The lowest cudnn we checked with the images you build was: issue3088:7.4.2.24, could it be worth while to also check a build with 7.4.1 ?

pretty sure i dont even need to rebuild, ill check that later

applied-machinelearning commented 4 years ago

hacking the stride value also seems to do something (obviously, I have no idea why):

diff --git a/training/deepspeech_training/util/feeding.py b/training/deepspeech_training/util/feeding.py
index 4c9b681d..ae50e4f9 100644
--- a/training/deepspeech_training/util/feeding.py
+++ b/training/deepspeech_training/util/feeding.py
@@ -33,7 +33,7 @@ def audio_to_features(audio, sample_rate, transcript=None, clock=0.0, train_phas

     spectrogram = contrib_audio.audio_spectrogram(audio,
                                                   window_size=Config.audio_window_samples,
-                                                  stride=Config.audio_step_samples,
+                                                  stride=Config.audio_step_samples+1,
                                                   magnitude_squared=True)

     if train_phase and augmentations is not None:

I tried this patch now, and that works for the small sets. The max_sequence_length for batch B has turned from 75 into 74 now.

But if I run the larger test set (train_differ_para_sorted_wav_filesize.log), it still blows up, now on files that end up having a max_sequence_length of 75 ...

train_debug_As_Bs_Cs.log train_debug_mini_As_Bs_Cs.log train_differ_para_sorted_wav_filesize.log

lissyx commented 4 years ago

The lowest cudnn we checked with the images you build was: issue3088:7.4.2.24, could it be worth while to also check a build with 7.4.1 ?

Downgraded to 7.4.1.5:

tf-docker ~ > apt-cache policy libcudnn7
libcudnn7:
  Installed: 7.4.1.5-1+cuda10.0
  Candidate: 7.6.5.32-1+cuda10.2
  Version table:
     7.6.5.32-1+cuda10.2 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.5.32-1+cuda10.1 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.5.32-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.4.38-1+cuda10.1 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.4.38-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.3.30-1+cuda10.1 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.3.30-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.2.24-1+cuda10.1 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.2.24-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.1.34-1+cuda10.1 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.1.34-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.0.64-1+cuda10.1 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.6.0.64-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.5.1.10-1+cuda10.1 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.5.1.10-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.5.0.56-1+cuda10.1 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.5.0.56-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.4.2.24-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
 *** 7.4.1.5-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
        100 /var/lib/dpkg/status
     7.3.1.20-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages
     7.3.0.29-1+cuda10.0 500
        500 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Packages

Still blows up.

lissyx commented 4 years ago

After lot of hacking, I've been able to rebuild locally outside of their docker (easier for playing with gdb), building and running against a pyenv-built python, and that builds reproduces the issue, so I'm preparing a debug build.

lissyx commented 4 years ago

Debug build with CUDA is ... challenging. Trying this as suggested: https://github.com/tensorflow/tensorflow/issues/28091#issuecomment-488327539

lissyx commented 4 years ago

The road to a debug build is ... complicated.

[12,032 / 15,573] 305 actions, 128 running
    Compiling tensorflow/core/kernels/cwise_op_gpu_bitwise_and.cu.cc [for host]; 106s local
    Compiling tensorflow/core/kernels/cwise_op_gpu_bitwise_or.cu.cc [for host]; 106s local
    Compiling tensorflow/core/kernels/cwise_op_gpu_bitwise_xor.cu.cc [for host]; 106s local
    Compiling tensorflow/core/kernels/cwise_op_gpu_add.cu.cc [for host]; 106s local
    Compiling tensorflow/core/kernels/cwise_op_gpu_div.cu.cc [for host]; 104s local
    Compiling tensorflow/core/kernels/cwise_op_gpu_equal_to.cu.cc [for host]; 102s local
    Compiling tensorflow/core/kernels/cwise_op_gpu_left_shift.cu.cc [for host]; 101s local
    Compiling tensorflow/core/kernels/cwise_op_gpu_floor_div.cu.cc [for host]; 99s local ...

Server terminated abruptly (error code: 14, error message: 'Socket closed', log file: '/home/alexandre/.cache/bazel/_bazel_alexandre/93bedb94245f10d899bd4ce902050079/server/jvm.out')

alexandre@serveur:~/Documents/codaz/Mozilla/DeepSpeech/tensorflow-lissyx$ aaa^C
alexandre@serveur:~/Documents/codaz/Mozilla/DeepSpeech/tensorflow-lissyx$ ll /home/alexandre/.cache/bazel/_bazel_alexandre/93bedb94245f10d899bd4ce902050079/server/jvm.out
-rw-r--r-- 1 alexandre alexandre 822 17 juil. 18:41 /home/alexandre/.cache/bazel/_bazel_alexandre/93bedb94245f10d899bd4ce902050079/server/jvm.out
alexandre@serveur:~/Documents/codaz/Mozilla/DeepSpeech/tensorflow-lissyx$ cat /home/alexandre/.cache/bazel/_bazel_alexandre/93bedb94245f10d899bd4ce902050079/server/jvm.out
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00007fcea090109e, pid=1171461, tid=1171475
#
# JRE version: OpenJDK Runtime Environment (Zulu11.29+3-CA) (11.0.2+7) (build 11.0.2+7-LTS)
# Java VM: OpenJDK 64-Bit Server VM (11.0.2+7-LTS, mixed mode, tiered, compressed oops, parallel gc, linux-amd64)
# Problematic frame:
# V  [libjvm.so+0xc5309e]  PerfLongVariant::sample()+0x1e
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/alexandre/Documents/codaz/Mozilla/DeepSpeech/tensorflow-lissyx/hs_err_pid1171461.log
#
# If you would like to submit a bug report, please visit:
#   http://www.azulsystems.com/support/
#
applied-machinelearning commented 4 years ago

Ugh the nightmare of a build-system called "Bazel".

lissyx commented 4 years ago

Ugh the nightmare of a build-system called "Bazel".

I guess in this case it's just I was running out of space on / because of docker not properly pruning some resources.

lissyx commented 4 years ago

Side effect: I have to rebuild all my dockers images / containers ...

applied-machinelearning commented 4 years ago

Ah yes you have to be careful with pruning since every change from a buildfile is it's own image layer due to the caching stuff. Works nice in sparing space, but if you want to delete old stuff it can be a nightmare. I try to get accustomed to dumping the images that a care about as a tar-file with everything included first, so I can restore that stuff if need be.

lissyx commented 4 years ago

Ah yes you have to be careful with pruning since every change from a buildfile is it's own image layer due to the caching stuff. Works nice in sparing space, but if you want to delete old stuff it can be a nightmare. I try to get accustomed to dumping the images that a care about as a tar-file with everything included first, so I can restore that stuff if need be.

Indeed, I have been doing my house-keeping but it seems to have not had completely cleaned up some things :/. Anyway, I now have something that should have more debug infos

lissyx commented 4 years ago

alexandre@serveur:~/tmp/issue3088$ ll wheel_dst/tensorflow_gpu_local-1.15.0-cp37-cp37m-linux_x86_64.whl 
-rw-r--r-- 1 alexandre alexandre 1,9G 20 juil. 14:26 wheel_dst/tensorflow_gpu_local-1.15.0-cp37-cp37m-linux_x86_64.whl
lissyx commented 4 years ago

And at least I repro with this build as well.

lissyx commented 4 years ago

Nothing obvious pops:


[Switching to Thread 0x7ff74ffff700 (LWP 209659)]                                                                                                                                                           

Thread 526 "DeepSpeech.py" hit Breakpoint 1, cudnnRNNForwardTrainingEx (handle=0x7ff71a00a5f0, rnnDesc=0x7ff748025900, xDesc=0x7ff748021870, x=0x7ff48da4dd00, hxDesc=0x7ff748017210, hx=0x7ff48b4d4300, cxDesc=0x7ff748006990, cx=0x7ff48b4d4300, wDesc=0x7ff748023f50, w=0x7ff492002900, yDesc=0x7ff74801f540, y=0x7ff48dce7d00, hyDesc=0x7ff748017210, hy=0x7ff48de0fd00, cyDesc=0x7ff748006990, cy=0x7ff48de13d00,   
    kDesc=0x0, keys=0x0, cDesc=0x0, cAttn=0x0, iDesc=0x0, iAttn=0x0, qDesc=0x0, queries=0x0, workSpace=0x7ff4aa032900, workSpaceSizeInBytes=136609792, reserveSpace=0x7ff48de17d00, reserveSpaceSizeInBytes=6062080) at ./tensorflow/stream_executor/cuda/cudnn_7_6.inc:2307                                                                                                                                             
2307    ./tensorflow/stream_executor/cuda/cudnn_7_6.inc: Aucun fichier ou dossier de ce type.                                                                                                               
(gdb) cont                                                                                                                                                                                                  
Continuing.                                                                                                                                                                                                 
[Thread 0x7ffeb4874700 (LWP 209805) exited]                                                                                                                                                                                                                                                                                                                                                                              

Thread 526 "DeepSpeech.py" hit Breakpoint 1, 0x00007ff981aa0d50 in cudnnRNNForwardTrainingEx () from /home/alexandre/Documents/codaz/Mozilla/DeepSpeech/CUDA-10.0/lib64/libcudnn.so.7                                                                                                                                                                                                                                    
(gdb) cont                                                                                                                                                                                                                                                                                                                                                                                                               
Continuing.                                                                                                                                                                                                 
[Detaching after fork from child process 209995]                                                                                                                                                            
[Switching to Thread 0x7ff82d7fa700 (LWP 209615)]                                                                                                                                                           

Thread 482 "DeepSpeech.py" hit Breakpoint 1, cudnnRNNForwardTrainingEx (handle=0x7ff8100081e0, rnnDesc=0x7ff1233fab70, xDesc=0x7ff1230fc010, x=0x7ff1a9a68800, hxDesc=0x7ff1233faa70, hx=0x7ff1a9d0b800, cxDesc=0x7ff123344d30, cx=0x7ff1a9d0b800, wDesc=0x7ff1230fbfd0, w=0x7ff1a150d800, yDesc=0x7ff1230fc050, y=0x7ff1a9d0f800, hyDesc=0x7ff1233faa70, hy=0x7ff1a9e3b800, cyDesc=0x7ff123344d30, cy=0x7ff1a9e3f800,   
    kDesc=0x0, keys=0x0, cDesc=0x0, cAttn=0x0, iDesc=0x0, iAttn=0x0, qDesc=0x0, queries=0x0, workSpace=0x7ff1a9e43800, workSpaceSizeInBytes=139886624, reserveSpace=0x7ff1b23ab900, reserveSpaceSizeInBytes=6144000) at ./tensorflow/stream_executor/cuda/cudnn_7_6.inc:2307                                                                                                                                             
2307    in ./tensorflow/stream_executor/cuda/cudnn_7_6.inc                                                                                                                                                                                                                                                                                                                                                               
(gdb)                                                                                                                                                                                                                                                                                                                                                                                                                    
Continuing.                                                                                                                                                                                                                                                                                                                                                                                                              

Thread 482 "DeepSpeech.py" hit Breakpoint 1, 0x00007ff981aa0d50 in cudnnRNNForwardTrainingEx () from /home/alexandre/Documents/codaz/Mozilla/DeepSpeech/CUDA-10.0/lib64/libcudnn.so.7                                                                                                                                                                                                                                    
(gdb)                                                                                                                                                                                                                                                                                                                                                                                                                    
Continuing.                                                                  
Epoch 0 |   Training | Elapsed Time: 0:01:06 | Steps: 1 | Loss: 190.842316                                                                                                                                                                                                                                                                                                                                               
--------------------------------------------------------------------------------                                                                                                                                                                                                                                                                                                                                         
Epoch 1 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                

[...]

[Switching to Thread 0x7ff74effd700 (LWP 209661)]

Thread 528 "DeepSpeech.py" hit Breakpoint 1, cudnnRNNForwardTrainingEx (handle=0x7ff71a00a5f0, rnnDesc=0x7ff748025900, xDesc=0x7ff74402f660, x=0x7ff48da5fc00, hxDesc=0x7ff7440293a0, hx=0x7ff48dd02c00, cxDesc=0x7ff744029310, cx=0x7ff48dd02c00, wDesc=0x7ff748023f50, w=0x7ff492002900, yDesc=0x7ff74402bfc0, y=0x7ff48dd06c00, hyDesc=0x7ff7440293a0, hy=0x7ff48de32c00, cyDesc=0x7ff744029310, cy=0x7ff48de36c00, 
    kDesc=0x0, keys=0x0, cDesc=0x0, cAttn=0x0, iDesc=0x0, iAttn=0x0, qDesc=0x0, queries=0x0, workSpace=0x7ff4aa032900, workSpaceSizeInBytes=136609792, reserveSpace=0x7ff48de3ac00, reserveSpaceSizeInBytes=6144000) at ./tensorflow/stream_executor/cuda/cudnn_7_6.inc:2307
2307    in ./tensorflow/stream_executor/cuda/cudnn_7_6.inc
(gdb) 
Continuing.
[Switching to Thread 0x7ff72effd700 (LWP 209668)]

Thread 535 "DeepSpeech.py" hit Breakpoint 1, cudnnRNNForwardTrainingEx (handle=0x7ff8100081e0, rnnDesc=0x7ff1233fab70, xDesc=0x7ff4290342e0, x=0x7ff1a9a56900, hxDesc=0x7ff168009dc0, hx=0x7ff1a9cf0900, cxDesc=0x7ff429001c60, cx=0x7ff1a9cf0900, wDesc=0x7ff1230fbfd0, w=0x7ff1a150d500, yDesc=0x7ff429006c40, y=0x7ff1a9cf4900, hyDesc=0x7ff168009dc0, hy=0x7ff1a9e1c900, cyDesc=0x7ff429001c60, cy=0x7ff1a9e20900, 
    kDesc=0x0, keys=0x0, cDesc=0x0, cAttn=0x0, iDesc=0x0, iAttn=0x0, qDesc=0x0, queries=0x0, workSpace=0x7ff1a9e24900, workSpaceSizeInBytes=139821088, reserveSpace=0x7ff1b237ca00, reserveSpaceSizeInBytes=6062080) at ./tensorflow/stream_executor/cuda/cudnn_7_6.inc:2307
2307    in ./tensorflow/stream_executor/cuda/cudnn_7_6.inc
(gdb) 
Continuing.
[Switching to Thread 0x7ff74effd700 (LWP 209661)]

Thread 528 "DeepSpeech.py" hit Breakpoint 1, 0x00007ff981aa0d50 in cudnnRNNForwardTrainingEx () from /home/alexandre/Documents/codaz/Mozilla/DeepSpeech/CUDA-10.0/lib64/libcudnn.so.7
(gdb) 
Continuing.
[Switching to Thread 0x7ff72effd700 (LWP 209668)]

Thread 535 "DeepSpeech.py" hit Breakpoint 1, 0x00007ff981aa0d50 in cudnnRNNForwardTrainingEx () from /home/alexandre/Documents/codaz/Mozilla/DeepSpeech/CUDA-10.0/lib64/libcudnn.so.7
(gdb) 
Continuing.
2020-07-20 14:43:04.648417: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-20 14:43:04.648554: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048] 
applied-machinelearning commented 4 years ago

Bummer. If I read: https://docs.nvidia.com/deeplearning/sdk/cudnn-release-notes/rel_7xx.html#rel_713 , it seems there have been LSTM related issues before hanging on specific sizes, in this case of the hidden state. But that was already fixed in all the cudnn versions we tested. I still can't wrap my head around why the TF 14 image seems to behave differently, you kind of ruled out the cudnn version. There also have been some changes to TF contrib/cudnn_rnn between v1.14 and v1.15, but my limited insights couldn't spot anything very amiss: https://github.com/tensorflow/tensorflow/commits/r1.15/tensorflow/contrib/cudnn_rnn

lissyx commented 4 years ago

There also have been some changes to TF contrib/cudnn_rnn between v1.14 and v1.15, but my limited insights couldn't spot anything very amiss:

I can always try and git bisect that ...

applied-machinelearning commented 4 years ago

First would be to check if a custom build TF14 doesn't have the problem (with the 7.4.1.5 cudnn and/or the newest). If so it would point to a change in TF, if not ... nah don't think about that yet ..

lissyx commented 4 years ago

First would be to check if a custom build TF14 doesn't have the problem (with the 7.4.1.5 cudnn and/or the newest). If so it would point to a change in TF, if not ... nah don't think about that yet ..

yeah that's what I'm doing ...

lissyx commented 4 years ago

Ok, passes with 1.14.1 + CUDNN 7.6 built locally. But a few patches are required, this is going to make git bisect slower than I would have loved.

lissyx commented 4 years ago

3c6e3868ac14fdbcaa24ddfb05624a0b55f60263 is the first bad commit
commit 3c6e3868ac14fdbcaa24ddfb05624a0b55f60263
Author: Ayush Dubey <ayushd@google.com>
Date:   Wed Aug 14 13:19:26 2019 -0700

    Ensure that an error is returned if a collective op runs with int32 on GPU.

    This change fixes a bug that would overwrite the error status with an OK status
    and cause a hang downstream.  It also adds a test that covers this scenario.

    PiperOrigin-RevId: 263414497

 .../common_runtime/base_collective_executor.cc     | 15 +++++++-------
 tensorflow/python/ops/collective_ops_gpu_test.py   | 23 ++++++++++++++++++++++
 2 files changed, 30 insertions(+), 8 deletions(-)
lissyx commented 4 years ago
3c6e3868ac14fdbcaa24ddfb05624a0b55f60263 is the first bad commit
commit 3c6e3868ac14fdbcaa24ddfb05624a0b55f60263
Author: Ayush Dubey <ayushd@google.com>
Date:   Wed Aug 14 13:19:26 2019 -0700

    Ensure that an error is returned if a collective op runs with int32 on GPU.

    This change fixes a bug that would overwrite the error status with an OK status
    and cause a hang downstream.  It also adds a test that covers this scenario.

    PiperOrigin-RevId: 263414497

 .../common_runtime/base_collective_executor.cc     | 15 +++++++-------
 tensorflow/python/ops/collective_ops_gpu_test.py   | 23 ++++++++++++++++++++++
 2 files changed, 30 insertions(+), 8 deletions(-)

That seems like a weird bad commit, I'll verify that tomorrow ...

lissyx commented 4 years ago
3c6e3868ac14fdbcaa24ddfb05624a0b55f60263 is the first bad commit
commit 3c6e3868ac14fdbcaa24ddfb05624a0b55f60263
Author: Ayush Dubey <ayushd@google.com>
Date:   Wed Aug 14 13:19:26 2019 -0700

    Ensure that an error is returned if a collective op runs with int32 on GPU.

    This change fixes a bug that would overwrite the error status with an OK status
    and cause a hang downstream.  It also adds a test that covers this scenario.

    PiperOrigin-RevId: 263414497

 .../common_runtime/base_collective_executor.cc     | 15 +++++++-------
 tensorflow/python/ops/collective_ops_gpu_test.py   | 23 ++++++++++++++++++++++
 2 files changed, 30 insertions(+), 8 deletions(-)

That seems like a weird bad commit, I'll verify that tomorrow ...

And yet, r1.15 and reverting this commit no more issue. So, is this bugged or is this exposing a long-standing issue ? On our side or tensorflow or CUDNN ?

applied-machinelearning commented 4 years ago

Hmm weird and a small sigh, hoped that it would have delivered a more clear an pinpointed problem ... Any idea where this op would be used in the context of DeepSpeech and the max_sequence_length array and or the hidden state ? Perhaps it would be wise to try to get some help from TF people / Nvidia based on this ? We do have a commit and some docker test cases with data that triggers the issue.

lissyx commented 4 years ago

Hmm weird and a small sigh, hoped that it would have delivered a more clear an pinpointed problem ...

I would have hoped as well

Any idea where this op would be used in the context of DeepSpeech and the max_sequence_length array and or the hidden state ?

Absolutely none. But it's interesting, because in the past we had to hack a thing: https://github.com/tensorflow/tensorflow/issues/20369 it might be a long shot, but DT_INT32 + GPU also appears here.

Perhaps it would be wise to try to get some help from TF people / Nvidia based on this ?

That's the next step yeah, I'd like to limit as much as possible the repro steps and summup them. I still have not been able to get a clear understanding of the triggering condition, though, because previous hacking changing the feature len value from offending 75 to other value, I could have valid passes with 75. So it's not really crystal clear to me that the issue is this specific value, and I need to qualify better what is happening here.

lissyx commented 4 years ago
3c6e3868ac14fdbcaa24ddfb05624a0b55f60263 is the first bad commit
commit 3c6e3868ac14fdbcaa24ddfb05624a0b55f60263
Author: Ayush Dubey <ayushd@google.com>
Date:   Wed Aug 14 13:19:26 2019 -0700

    Ensure that an error is returned if a collective op runs with int32 on GPU.

    This change fixes a bug that would overwrite the error status with an OK status
    and cause a hang downstream.  It also adds a test that covers this scenario.

    PiperOrigin-RevId: 263414497

 .../common_runtime/base_collective_executor.cc     | 15 +++++++-------
 tensorflow/python/ops/collective_ops_gpu_test.py   | 23 ++++++++++++++++++++++
 2 files changed, 30 insertions(+), 8 deletions(-)

That seems like a weird bad commit, I'll verify that tomorrow ...

And yet, r1.15 and reverting this commit no more issue. So, is this bugged or is this exposing a long-standing issue ? On our side or tensorflow or CUDNN ?

Bad news: it seems the issue is somehow intermittent, and after a few retry with this reverted, it's back and still here ...

lissyx commented 4 years ago

I will restart bisection then and run it multiple times before calling good / bad ...

applied-machinelearning commented 4 years ago

That's the next step yeah, I'd like to limit as much as possible the repro steps and summup them. I still have not been able to get a clear understanding of the triggering condition, though, because previous hacking changing the feature len value from offending 75 to other value, I could have valid passes with 75. So it's not really crystal clear to me that the issue is this specific value, and I need to qualify better what is happening here.

I agree, because batch C also gives 75 and that also passes.

What I am also wondering about is that how it could work by artificially limiting the max_sequence_length, since the data that we feed it self isn't changed. (so I would have expected it to blow up, because the sequences now seem longer than the max_sequence_length, or does it just not process the last bit of (padded or non-padded data in which the culprit lies ?)

Found some dicussions around this whole padding topic with @Reuben posting there: https://github.com/tensorflow/tensorflow/issues/23269 https://github.com/mozilla/DeepSpeech/issues/885

A commit in TF 1.15-rc0 seemed also seemed more interessting than the bisection came up with: https://github.com/tensorflow/tensorflow/commit/9380a41290e8fb8b9ea85f614472deab56dbc481#diff-8e54a26c3d435aad346bfa12f4c6ec79

Another interesting DS change mingling with the batches could be: https://github.com/mozilla/DeepSpeech/commit/6b1d6773de25aaf1c1c157f8c11ecdd727f00c6d

Especially the lines, I can't see any changes or explanation in the usage of returned values from create_dataset(), so why are the output_types changed ?: https://github.com/mozilla/DeepSpeech/commit/6b1d6773de25aaf1c1c157f8c11ecdd727f00c6d#diff-2f5b069cc3a96ce123ef7356642acb29R143-R145 But I'm not that familiar with the code, so likely I'm missing something. EDIT: hmm seems I was able to miss the map() a few lines below and the changes to entry_to_features().

lissyx commented 4 years ago

What I am also wondering about is that how it could work by artificially limiting the max_sequence_length, since the data that we feed it self isn't changed. (so I would have expected it to blow up, because the sequences now seem longer than the max_sequence_length, or does it just not process the last bit of (padded or non-padded data in which the culprit lies ?)

There should be tensorflow code that already takes care of that. Now, maybe, for some reason, it's not working as expected in this case? Anyway, in the current state, we have not yet any criterion to do so.

lissyx commented 4 years ago

Interesting. I'm re-doing bisect, with more runs on each test to ensure I avoid any intermittent behavior. Maybe with some luck, this will pop. (unfortunately your direct link just gives me PR, not the direct diff you expected, so I'm not sure what part of the PR you mean)

Another interesting DS change mingling with the batches could be: 6b1d677

Have you experimented before / after this commit ?

lissyx commented 4 years ago

Re-doing bisect yields:

24297a4cb9120351643f7ac3916e7398236ccc0d is the first bad commit
commit 24297a4cb9120351643f7ac3916e7398236ccc0d
Author: Kaixi Hou <kaixih@nvidia.com>
Date:   Fri Jul 19 13:41:25 2019 -0700

    use padded IO for cudnn rnn only when necessary

 tensorflow/core/kernels/cudnn_rnn_ops.cc           | 42 +++++++++++++++++-----
 tensorflow/stream_executor/cuda/cuda_dnn.cc        | 13 ++++---
 tensorflow/stream_executor/cuda/cuda_dnn.h         |  3 +-
 tensorflow/stream_executor/dnn.h                   |  4 ++-
 .../stream_executor/stream_executor_pimpl.cc       |  5 +--
 tensorflow/stream_executor/stream_executor_pimpl.h |  3 +-
 6 files changed, 52 insertions(+), 18 deletions(-)

https://github.com/tensorflow/tensorflow/commit/24297a4cb9120351643f7ac3916e7398236ccc0d https://github.com/tensorflow/tensorflow/pull/30889

I'll see how much that holds.

lissyx commented 4 years ago

Re-doing bisect yields:

24297a4cb9120351643f7ac3916e7398236ccc0d is the first bad commit
commit 24297a4cb9120351643f7ac3916e7398236ccc0d
Author: Kaixi Hou <kaixih@nvidia.com>
Date:   Fri Jul 19 13:41:25 2019 -0700

    use padded IO for cudnn rnn only when necessary

 tensorflow/core/kernels/cudnn_rnn_ops.cc           | 42 +++++++++++++++++-----
 tensorflow/stream_executor/cuda/cuda_dnn.cc        | 13 ++++---
 tensorflow/stream_executor/cuda/cuda_dnn.h         |  3 +-
 tensorflow/stream_executor/dnn.h                   |  4 ++-
 .../stream_executor/stream_executor_pimpl.cc       |  5 +--
 tensorflow/stream_executor/stream_executor_pimpl.h |  3 +-
 6 files changed, 52 insertions(+), 18 deletions(-)

I'll see how much that holds.

5 runs of a r1.15 build without this patch works like a charm on the repro case. I'm running 20 more, but if that holds, it means we actually have something much more actionable now.

applied-machinelearning commented 4 years ago

Ahh this one does sound related :+1: