mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.36k stars 3.97k forks source link

Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 31 , 12, 2048] #3088

Closed andrenatal closed 4 years ago

andrenatal commented 4 years ago

For support and discussions, please use our Discourse forums.

If you've found a bug, or have a feature request, then please create an issue with the following information:

set -xe

apt-get install -y python3-venv libopus0

python3 -m venv /tmp/venv

source /tmp/venv/bin/activate

pip install -U setuptools wheel pip

pip install .

pip uninstall -y tensorflow

pip install tensorflow-gpu==1.14

mkdir -p ../keep/summaries

data="${SHARED_DIR}/data" fis="${data}/LDC/fisher" swb="${data}/LDC/LDC97S62/swb" lbs="${data}/OpenSLR/LibriSpeech/librivox" cv="${data}/mozilla/CommonVoice/en_1087h_2019-06-12/clips" npr="${data}/NPR/WAMU/sets/v0.3"

python -u DeepSpeech.py \ --train_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/treino_filtered_alphabet.csv \ --dev_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/dev_filtered_alphabet.csv \ --test_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/teste_filtered_alphabet.csv \ --train_batch_size 12 \ --dev_batch_size 24 \ --test_batch_size 24 \ --scorer ~/projects/corpora/deepspeech-pretrained-ptbr/kenlm.scorer \ --alphabet_config_path ~/projects/corpora/deepspeech-pretrained-ptbr/alphabet.txt \ --train_cudnn \ --n_hidden 2048 \ --learning_rate 0.0001 \ --dropout_rate 0.40 \ --epochs 150 \ --noearly_stop \ --audio_sample_rate 8000 \ --save_checkpoint_dir ~/projects/corpora/deepspeech-fulltrain-ptbr \ --use_allow_growth \ --log_level 0


I'm getting the following error when using my ptbr 8khz dataset to train. Have tried to downgrade and upgrade cuda, cudnn, nvidia-drivers, and ubuntu (16 and 18) and the error persists. I have tried with datasets containing two different characteristics: 6s and 15s in length. Both contain audios in 8khz.

andre@andrednn:~/projects/DeepSpeech$ bash .compute_msprompts

W0618 12:30:10.324707 139639980619584 lazy_loader.py:50] The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dt ype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a f uture version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326584 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype i s deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0618 12:30:10.401312 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. W0618 12:30:11.297271 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. 2020-06-18 12:30:11.458650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:05:00.0 2020-06-18 12:30:11.459790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:06:00.0 2020-06-18 12:30:11.460897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:09:00.0 2020-06-18 12:30:11.462003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:0a:00.0 2020-06-18 12:30:11.462041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-06-18 12:30:11.462071: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-06-18 12:30:11.462085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-06-18 12:30:11.462097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-06-18 12:30:11.462109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-06-18 12:30:11.462121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-06-18 12:30:11.462133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-18 12:30:11.470539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3 2020-06-18 12:30:11.470679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-18 12:30:11.470694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 1 2 3 2020-06-18 12:30:11.470699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N Y Y Y 2020-06-18 12:30:11.470703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1: Y N Y Y 2020-06-18 12:30:11.470707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2: Y Y N Y 2020-06-18 12:30:11.470710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3: Y Y Y N 2020-06-18 12:30:11.476196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.477355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.478490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.479608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute ca pability: 6.1) D Session opened. I Could not find best validating checkpoint. I Could not find most recent checkpoint. I Initializing all variables. 2020-06-18 12:30:12.233482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 I STARTING Optimization Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2020-06-18 12:30:14.672316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 Epoch 0 | Training | Elapsed Time: 0:00:16 | Steps: 33 | Loss: 18.239303 2 020-06-18 12:30:30.589204: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.param s_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), w orkspace.size(), reserve_space.opaque(), reserve_space.size())' 2020-06-18 12:30:30.589243: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_uni ts, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] Traceback (most recent call last): File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] [[tower_2/CTCLoss/_147]] 1 successful operations. 2 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "DeepSpeech.py", line 12, in ds_train.run_script() File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script absl.app.run(main) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main train() File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 608, in train trainloss, = run_set('train', epoch, train_init_op) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 568, in run_set feed_dict=feed_dict) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[tower_2/CTCLoss/_147]] 1 successful operations. 2 derived errors ignored.

Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3_1': File "DeepSpeech.py", line 12, in ds_train.run_script() File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script absl.app.run(main) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main train()

File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 487, in train gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 313, in get_tower_results avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 240, in calculate_mean_edit_distance_andloss logits, = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 191, in create_model output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 129, in rnn_impl_cudnn_rnn sequence_lengths=seq_length) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in call outputs = super(Layer, self).call(inputs, *args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in call outputs = call_fn(cast_inputs, *args, *kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper return converted_call(f, options, args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call return _call_unconverted(f, args, kwargs, options) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted return f(args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call training) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward seed=self._seed) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn outputs, output_h, outputc, , _ = gen_cudnn_rnn_ops.cudnn_rnnv3(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3 time_major=time_major, name=name) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

lissyx commented 4 years ago

Still no repro with this reverted after 20 tries, I guess we are on track now.

lissyx commented 4 years ago

revert-24297a4cb9120351643f7ac3916e7398236ccc0d.patch.txt

@applied-machinelearning revert of this is a bit non trivial, here is the diff to it, if you are willing to rebuild tensorflow gpu completely (might take some hours depending on your hw)

applied-machinelearning commented 4 years ago

revert-24297a4cb9120351643f7ac3916e7398236ccc0d.patch.txt

@applied-machinelearning revert of this is a bit non trivial, here is the diff to it, if you are willing to rebuild tensorflow gpu completely (might take some hours depending on your hw)

Do you have a docker build file for that laying around or did you do it on baremetal ?

lissyx commented 4 years ago

revert-24297a4cb9120351643f7ac3916e7398236ccc0d.patch.txt @applied-machinelearning revert of this is a bit non trivial, here is the diff to it, if you are willing to rebuild tensorflow gpu completely (might take some hours depending on your hw)

Do you have a docker build file for that laying around or did you do it on baremetal ?

I did it baremetal, ill continue tomorrow to see if we can act from python side instead of rebuilding

lissyx commented 4 years ago

looks like it's not something we have leverage on from python code, i'll run a few checks to see if there anything obvious and if not, then we'll follow up with tensorflow issue

lissyx commented 4 years ago

Another weird behavior:

So maybe it needs more digging into that patch itself.

applied-machinelearning commented 4 years ago

Interestingly enough from the pull request it seems they also had intermittent test failures, but it seems they were not addressed.

With a stock TF 1.15 the error message indicates we are running cudnnRNNForwardTrainingEx() when it breaks. The cudnn docs indicate it is version of the function for padded data: https://docs.nvidia.com/deeplearning/sdk/cudnn-archived/cudnn_765/cudnn-api/index.html#cudnnRNNForwardTrainingEx

We don't get CUDNN_STATUS_BAD_PARAM back, so it seems to accept the parameters listed there and not blow up immediatly in those. We get CUDNN_STATUS_EXECUTION_FAILED.

It also lists some conditions for the data layout:

This routine is the extended version of the cudnnRNNForwardTraining() function. The cudnnRNNForwardTrainingEx() allows the user to use unpacked (padded) layout for input x and output y.

In the unpacked layout, each sequence in the mini-batch is considered to be of fixed length, specified by maxSeqLength in its corresponding RNNDataDescriptor. Each fixed-length sequence, for example, the nth sequence in the mini-batch, is composed of a valid segment specified by the seqLengthArray[n] in its corresponding RNNDataDescriptor; and a padding segment to make the combined sequence length equal to maxSeqLength.

And also for the order within the mini-batch in a special case:

With the unpacked layout, both sequence major (meaning, time major) and batch major are supported. For backward compatibility, the packed sequence major layout is supported. However, similar to the non-extended function cudnnRNNForwardTraining(), the sequences in the mini-batch need to be sorted in descending order according to length.

My interpretation of the last piece would be: You can stuff packed/unpadded sequences into cudnnRNNForwardTrainingEx() although it is meant for unpacked/padded, on the premises that the sequences are sorted in descending order according to length.

Which functions did your test blowup with when "forcing ShouldUsePaddedIO to return false", cudnnRNNForwardTrainingEx() of cudnnRNNForwardTraining(), so the extended or the not extended version ?

Unfortunately it's quite a pain to get stuff printed at some interesting places ... the log output of docker images you made issue3088_7.6.5.3 etc., do output a lot of extra internal data, but I find it hard to interpret.

lissyx commented 4 years ago

Which functions did your test blowup with when "forcing ShouldUsePaddedIO to return false", cudnnRNNForwardTrainingEx() of cudnnRNNForwardTraining(), so the extended or the not extended version ?

It looks like very much the same:

2020-07-22 10:20:21.132203: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED                                                                                                          
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 10:20:21.132242: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048] 

This is with forcing this way:

diff --git a/tensorflow/core/kernels/cudnn_rnn_ops.cc b/tensorflow/core/kernels/cudnn_rnn_ops.cc
index 4a27394f28..c3aa386ea7 100644
--- a/tensorflow/core/kernels/cudnn_rnn_ops.cc
+++ b/tensorflow/core/kernels/cudnn_rnn_ops.cc
@@ -1463,8 +1463,8 @@ class CudnnRNNForwardOp<GPUDevice, T> : public CudnnRNNKernelCommon {
                                   context, model_types(), time_major, &input,
                                   &input_h, &input_c, &params,
                                   &sequence_lengths, num_proj, &model_shapes));
-      use_padded_io =
-          ShouldUsePaddedIO(sequence_lengths, model_shapes, time_major);
+      use_padded_io = false;
+          // ShouldUsePaddedIO(sequence_lengths, model_shapes, time_major);
     } else {
       OP_REQUIRES_OK(context,
                      ExtractForwardInput(context, model_types(), time_major,
@@ -1863,8 +1863,8 @@ class CudnnRNNBackwardOp<GPUDevice, T> : public CudnnRNNKernelCommon {
                                   context, model_types(), time_major, &input,
                                   &input_h, &input_c, &params,
                                   &sequence_lengths, num_proj, &model_shapes));
-      use_padded_io =
-          ShouldUsePaddedIO(sequence_lengths, model_shapes, time_major);
+      use_padded_io = false;
+          // ShouldUsePaddedIO(sequence_lengths, model_shapes, time_major);
     } else {
       OP_REQUIRES_OK(context,
                      ExtractForwardInput(context, model_types(), time_major,
lissyx commented 4 years ago

Not what I would have expected, but forcing true for use_padded_io does indeed ... work? I have four retries are that working. It's ... weird.

lissyx commented 4 years ago

Not what I would have expected, but forcing true for use_padded_io does indeed ... work? I have four retries are that working. It's ... weird.

Or it's consistent: our data is requiring padding to be padded.

lissyx commented 4 years ago

I STARTING Optimization                        
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                2
020-07-22 11:09:21.092508: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                      
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false                                                                                                                                                          
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74                                                            
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false                                                 
ShouldUsePaddedIO time_major=1                                                               
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                      
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                       
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                      
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                        
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                          
ShouldUsePaddedIO time_major=1                                                 
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                      
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                      
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74             
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                          
Epoch 0 |   Training | Elapsed Time: 0:00:05 | Steps: 1 | Loss: 189.949219                                                                                                                                                                                                                                                                                                                                               
--------------------------------------------------------------------------------                                                                                                                            
Epoch 1 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                S
houldUsePaddedIO time_major=1                                                                                                                                                                               
ShouldUsePaddedIO [0]: seq_array[i]=74                
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                       
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                       
ShouldUsePaddedIO rv=false all_max_seq_length=true
ShouldUsePaddedIO time_major=1                                                                                                                                                                              
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                       
ShouldUsePaddedIO rv=true all_max_seq_length=false                           
2020-07-22 11:09:30.528222: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED                                                                                                          
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 11:09:30.528291: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1522 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048] 
lissyx commented 4 years ago

Captured also the debug I added on a non repro case:


I STARTING Optimization                                                                                                                                                                                                                                                                                                                                                                                                  
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                2
020-07-22 11:30:10.205219: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0                                                                                                                                                                                                                                                                         
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=true all_max_seq_length=false                                                                                                                                                                                                                                                                                                                                                                       
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=true all_max_seq_length=false                                                                                                                                                                                                                                                                                                                                                                       
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                                                                                                                                                                                                                                       
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                                                                                                                                                                                                                                       
Epoch 0 |   Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 189.949173                                                                                                                                                                                                                                                                                                                                               
Epoch 1 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                S
houldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                            
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=true all_max_seq_length=false                                                                                                                                                                                                                                                                                                                                                                       
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                                                                                                                                                                                                                                       
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                     
ShouldUsePaddedIO rv=true all_max_seq_length=false                                                                                                                                                                                                                                                                                                                                                                       
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                      
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                       
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                          
Epoch 1 |   Training | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 107.969971                                                                                                                                                                                                                                                                                                                                               
Epoch 2 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                S
houldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                            
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                       
ShouldUsePaddedIO rv=true all_max_seq_length=false                                                                                                                                                                                                                                                                                                                                                                       
ShouldUsePaddedIO time_major=1                                                                                                                                                                              
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                      
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                                                                                                                                                                                                                                       
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=true all_max_seq_length=false                                                                                                                                                                                                                                                                                                                                                                       
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                      
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                                                                                                                                                                                                                                       
Epoch 2 |   Training | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 73.916595                                                                                                                                                                                                                                                                                                                                                
I FINISHED optimization in 0:00:03.484708                                                                                                                                                                   
D Session closed.                                                                                                                                                                                           
lissyx commented 4 years ago

Those two logs seems to fit your analysis @applied-machinelearning: it crashes when the max_seq_length=75 is not the first value we push.

applied-machinelearning commented 4 years ago

Could it be handy to not short circuit the loop in the ShouldUsePaddedIO() and let it print out the whole lot of sequence sizes and their order in that mini-batch, before returning to get it completely clear (or print out the whole seq_array at the start) ?

And it doesn't seem to need it completely sorted in the minibatch either (if you look at the non-repro case [75, 74, 74, 75, 74, 74] seems to work as well. So it looks like at least the first max_sequence_length should be the or one of the largest for the mini-batch ?

lissyx commented 4 years ago

Could it be handy to not short circuit the loop in the ShouldUsePaddedIO() and let it print out the whole lot of sequence sizes and their order in that mini-batch, before returning to get it completely clear (or print out the whole seq_array at the start) ?

Those latest logs where not produced by forcing any return value to ShouldUsePaddedIO

lissyx commented 4 years ago

There is some:


I STARTING Optimization                                                                            
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                2
020-07-22 13:18:12.007877: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[('csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav', 48366, 'en zijn huisje verlaten was'), ('csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav', 48366, 'dat vertrokken mondje hij'), ('csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav', 48368, 'was zo woest dat'), ('csvs/../d
ata/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav', 48520, 'hij gaf geen antwoord'), ('csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav', 48524, 'en in de tien minuten die de lift'), ('csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav', 48524, 'informeerde hij of die')]
CudnnRNNForwardOp    
ShouldUsePaddedIO time_major=1                                                                                                                                                                              
ShouldUsePaddedIO seq_array[0]=74           
ShouldUsePaddedIO seq_array[1]=75                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                       
ShouldUsePaddedIO rv=true all_max_seq_length=false
CudnnRNNBackwardOp                                                           
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO seq_array[0]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO seq_array[1]=75
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=true all_max_seq_length=false
CudnnRNNForwardOp        
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74                          
ShouldUsePaddedIO seq_array[1]=74              
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                       
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                       
ShouldUsePaddedIO rv=false all_max_seq_length=true
CudnnRNNBackwardOp                                                                                                                                                                                          
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74                                                                 
ShouldUsePaddedIO seq_array[1]=74
ShouldUsePaddedIO [0]: seq_array[i]=74                                                             
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                        
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                      
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                       
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                          
Epoch 0 |   Training | Elapsed Time: 0:00:05 | Steps: 1 | Loss: 189.949219                                                                                                                                                                                                                                                                                                                                               
--------------------------------------------------------------------------------                                                                                                                            
Epoch 1 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                [
('csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav', 48366, 'en zijn huisje verlaten was'), ('csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav', 48366, 'dat vertrokken mondje hij'), ('csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav', 48368, 'was zo woest dat'), ('csvs/../da
ta/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav', 48520, 'hij gaf geen antwoord'), ('csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav', 48524, 'en in de tien minuten die de lift'), ('csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav', 48524, 'informeerde hij of die')]
CudnnRNNForwardOp                                                                                                                                                                                           
ShouldUsePaddedIO time_major=1                                    
ShouldUsePaddedIO seq_array[0]=74                                                                                                                                                                           
ShouldUsePaddedIO seq_array[1]=74                  
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                      
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                      
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74 
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                          
CudnnRNNForwardOp            
ShouldUsePaddedIO time_major=1                                                                                                                                                                              
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=75                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                       
ShouldUsePaddedIO rv=true all_max_seq_length=false                           
2020-07-22 13:18:20.837671: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED                                                                                                          
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
lissyx commented 4 years ago

Hm, it's not that clear. Here is a log after reversing the ordering when we read CSV file. As you can see, 75 is now first one and yet it fails.=:


I STARTING Optimization                                                                                                                                                                                                                                                                                                                                                                                                  
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                2
020-07-22 15:01:00.634318: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0                                                                                                                                                                                                                                                                         
[('csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav', 48366, 'en zijn huisje verlaten was'), ('csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav', 48366, 'dat vertrokken mondje hij'), ('csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav', 48368, 'was zo woest dat'), ('csvs/../d
ata/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav', 48520, 'hij gaf geen antwoord'), ('csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav', 48524, 'en in de tien minuten die de lift'), ('csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav', 48524, 'informeerde hij of die')]               
CudnnRNNForwardOp                                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO seq_array[0]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO seq_array[1]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                                                                                                                                                                                                                                       
CudnnRNNBackwardOp                                                                                                                                                                                                                                                                                                                                                                                                       
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO seq_array[0]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO seq_array[1]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                                                                                                                                                                                                                                       
CudnnRNNForwardOp                                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO seq_array[0]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO seq_array[1]=75                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=true all_max_seq_length=false                                                                                                                                                                                                                                                                                                                                                                       
CudnnRNNBackwardOp                                                                                                                                                                                                                                                                                                                                                                                                       
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO seq_array[0]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO seq_array[1]=75                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=true all_max_seq_length=false                                                                                                                                                                                                                                                                                                                                                                       
Epoch 0 |   Training | Elapsed Time: 0:00:05 | Steps: 1 | Loss: 190.842316                                                                                                                                                                                                                                                                                                                                               
--------------------------------------------------------------------------------                                                                                                                                                                                                                                                                                                                                         
Epoch 1 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                [
('csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav', 48366, 'en zijn huisje verlaten was'), ('csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav', 48366, 'dat vertrokken mondje hij'), ('csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav', 48368, 'was zo woest dat'), ('csvs/../da
ta/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav', 48520, 'hij gaf geen antwoord'), ('csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav', 48524, 'en in de tien minuten die de lift'), ('csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav', 48524, 'informeerde hij of die')]                
CudnnRNNForwardOp                                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO seq_array[0]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO seq_array[1]=75                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=true all_max_seq_length=false                                                                                                                                                                                                                                                                                                                                                                       
2020-07-22 15:01:09.398352: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED                                                                                                                                                                                                                                                                                                                       
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'                                                                                                                                                                                                                       
2020-07-22 15:01:09.398438: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048]                                                
CudnnRNNForwardOp                                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO time_major=1                                                                                                                                                                                                                                                                                                                                                                                           
ShouldUsePaddedIO seq_array[0]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO seq_array[1]=74                                                                                                                                                                                                                                                                                                                                                                                        
ShouldUsePaddedIO [0]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO [1]: seq_array[i]=74                                                                                                                                                                                                                                                                                                                                                                                   
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74                                                                                                                                                                                                                                                                                                                                                                    
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                                                                                                                                                                                                                                       
applied-machinelearning commented 4 years ago

Hm, it's not that clear. Here is a log after reversing the ordering when we read CSV file. As you can see, 75 is now first one and yet it fails.=:

Are you sure ? The mini-batch sequence array still seems [74, 75] and not [75, 74] ?

lissyx commented 4 years ago

Hm, it's not that clear. Here is a log after reversing the ordering when we read CSV file. As you can see, 75 is now first one and yet it fails.=:

Are you sure ? The mini-batch sequence array still seems [74, 75] and not [75, 74] ?

Are you referring to those lines?

ShouldUsePaddedIO seq_array[0]=74 ShouldUsePaddedIO seq_array[1]=75

applied-machinelearning commented 4 years ago

Yes.

lissyx commented 4 years ago

Well, I'm not 100% sure because we also have logs where there is this seq_array ordering and it succeeds

lissyx commented 4 years ago

Reversing the order, and still explodes:


I STARTING Optimization                                                                                                                                                                                                                                                                                                                                                                                                  
Epoch 0 |   Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000                                                                                                                                                                                                                                                                                                                                                2
020-07-22 15:59:45.113520: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
generate_values <deepspeech_training.util.sample_collections.CSV object at 0x7f63846338d0>
yield generate_values 0 csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a14d0>
yield generate_values 1 csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a15d0>
yield generate_values 2 csvs/../data/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a1150>
yield generate_values 3 csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a1550>
yield generate_values 4 csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a1750>
batch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>
batch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>
yield generate_values 5 csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a1190>
CudnnRNNForwardOp       
ShouldUsePaddedIO time_major=1                                                                    
ShouldUsePaddedIO seq_array[0]=75
ShouldUsePaddedIO seq_array[1]=75                                                                  
ShouldUsePaddedIO [0]: seq_array[i]=75                                                       
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                       
ShouldUsePaddedIO [1]: seq_array[i]=75                                                                                                                                                                      
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=75                                                                                                                                                       
ShouldUsePaddedIO rv=false all_max_seq_length=true                                           
CudnnRNNBackwardOp                                                                                                                                                                                          
ShouldUsePaddedIO time_major=1                                                 
ShouldUsePaddedIO seq_array[0]=75                                                                                                                                                                           
ShouldUsePaddedIO seq_array[1]=75
ShouldUsePaddedIO [0]: seq_array[i]=75                                                                                                                                                                      
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75             
ShouldUsePaddedIO [1]: seq_array[i]=75                                                                                                                                                                      
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=false all_max_seq_length=true                                                                                                                                                          
Epoch 0 |   Training | Elapsed Time: 0:00:03 | Steps: 1 | Loss: 189.891312                                                                                                                                                                                                                                                                                                                                              b
atch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>                    
CudnnRNNForwardOp                                     
ShouldUsePaddedIO time_major=1                                                                                                                                                                              
ShouldUsePaddedIO seq_array[0]=75
ShouldUsePaddedIO seq_array[1]=74                                                                                                                                                                           
ShouldUsePaddedIO [0]: seq_array[i]=75
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75                                                                                                                                                       
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=75                                                                                                                                                       
ShouldUsePaddedIO rv=true all_max_seq_length=false                           
2020-07-22 15:59:48.820700: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED                                                                                                          
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 15:59:48.820756: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048] 
lissyx commented 4 years ago

@applied-machinelearning I don't know what you think, but this defies all the assumptions I can make based on cudnn api doc and what we observe. There's something that is missing to justify the trigger of the issue, and so far, hacking the ordering does not seems to really be the real trigger here, but I can't figure it out, and with just "CUDNN_STATUS_EXECUTION_FAILED" as a feedback and no really usable debug information because of the closedness of CUDA, I don't see how we can investigate more without wasting our time on that.

applied-machinelearning commented 4 years ago

I think it all of this should be enough to issue a bug and directly ping the committer from the bisected commit and the TF var_length_sequence stuff. As they also can double check the CUDNN code and have more insight in all the (data) requirements.

Ah I see you just did that, apart from mentioning the Nvidia committer (could be worthwhile to get some attention from the relevant people faster).

lissyx commented 4 years ago

I think it all of this should be enough to issue a bug and directly ping the committer from the bisected commit and the TF var_length_sequence stuff. As they also can double check the CUDNN code and have more insight in all the (data) requirements.

Ah I see you just did that, apart from mentioning the Nvidia committer (could be worthwhile to get some attention from the relevant people faster).

Indeed, I was preparing extra info and pinged this person as well. Let's hope they can quickly assert on their side and come back to us.

applied-machinelearning commented 4 years ago

Great and thanks again for all your effort so far !

lissyx commented 4 years ago

@applied-machinelearning So, we've got some feedback from the nvidia dev, and it seems TF_CUDNN_RESET_RND_GEN_STATE=1 does help here. I'm unsure of the implications, especially in term of performances, but maybe you can give that a try on your full dataset, this could help us assert:

applied-machinelearning commented 4 years ago

@applied-machinelearning So, we've got some feedback from the nvidia dev, and it seems TF_CUDNN_RESET_RND_GEN_STATE=1 does help here. I'm unsure of the implications, especially in term of performances, but maybe you can give that a try on your full dataset, this could help us assert:

* it is indeed related to the issue
* have an idea of the perf impact

Was away from keyboard this weekend, running tests now. The short tests work with that ENV variable set, now running the longer one. Edit: The long test also works.

lissyx commented 4 years ago

@applied-machinelearning So, we've got some feedback from the nvidia dev, and it seems TF_CUDNN_RESET_RND_GEN_STATE=1 does help here. I'm unsure of the implications, especially in term of performances, but maybe you can give that a try on your full dataset, this could help us assert:

* it is indeed related to the issue
* have an idea of the perf impact

Was away from keyboard this weekend, running tests now. The short tests work with that ENV variable set, now running the longer one. Edit: The long test also works.

I'm presently running one or two training epochs with TF_CUDNN_RESET_RND_GEN_STATE=0 / TF_CUDNN_RESET_RND_GEN_STATE=1 to assert the impact

lissyx commented 4 years ago

@applied-machinelearning So, we've got some feedback from the nvidia dev, and it seems TF_CUDNN_RESET_RND_GEN_STATE=1 does help here. I'm unsure of the implications, especially in term of performances, but maybe you can give that a try on your full dataset, this could help us assert:

* it is indeed related to the issue
* have an idea of the perf impact

Was away from keyboard this weekend, running tests now. The short tests work with that ENV variable set, now running the longer one. Edit: The long test also works.

I'm presently running one or two training epochs with TF_CUDNN_RESET_RND_GEN_STATE=0 / TF_CUDNN_RESET_RND_GEN_STATE=1 to assert the impact

So I could not spot any huge difference: only a 20-secs per epoch slowdown.

applied-machinelearning commented 4 years ago

I think you are thinking about setting this environment var from the DS code as a workaround for not getting a TF 1.15.4 release ?

(I think it's not very wise keeping TF 1.15 that broken in the first place, it wastes a lot of resource everywhere from people having their training go bust and perhaps trying to debug that again (for all projects and people still using TF 1.15 with LSTM), while it is a straight and simple fix, so it would be a nice "reward" for digging in this and fixing this thing which was uncaught for so many releases), but that is my not so humble opinion about this.

Back to the environment var: If remember correctly from looking at the code, it influenced some kind of "dropout" and as extra busted the cache (which causes things to work for us), but I don't know what the influence of changing that specific dropout behavior has on training the model. Would be nice if the TF / Nvidia guys can give some comment on that, before we perhaps DS degrade training by missing any side effects.

lissyx commented 4 years ago

I think you are thinking about setting this environment var from the DS code as a workaround for not getting a TF 1.15.4 release ?

At least know if it's a good thing to debug people with that or if we are creating underlying issues.

(I think it's not very wise keeping TF 1.15 that broken in the first place, it wastes a lot of resource everywhere from people having their training go bust and perhaps trying to debug that again (for all projects and people still using TF 1.15 with LSTM), while it is a straight and simple fix, so it would be a nice "reward" for digging in this and fixing this thing which was uncaught for so many releases), but that is my not so humble opinion about this.

Sure, but it's not in our hands nor in the hands of people who will review the PR, there's a policy and they might have their hands tied.

Back to the environment var: If remember correctly from looking at the code, it influenced some kind of "dropout" and as extra busted the cache (which causes things to work for us), but I don't know what the influence of changing that specific dropout behavior has on training the model. Would be nice if the TF / Nvidia guys can give some comment on that, before we perhaps DS degrade training by missing any side effects.

Exactly.

applied-machinelearning commented 4 years ago

I think you are thinking about setting this environment var from the DS code as a workaround for not getting a TF 1.15.4 release ? At least know if it's a good thing to debug people with that or if we are creating underlying issues.

(I think it's not very wise keeping TF 1.15 that broken in the first place, it wastes a lot of resource everywhere from people having their training go bust and perhaps trying to debug that again (for all projects and people still using TF 1.15 with LSTM), while it is a straight and simple fix, so it would be a nice "reward" for digging in this and fixing this thing which was uncaught for so many releases), but that is my not so humble opinion about this.

Sure, but it's not in our hands nor in the hands of people who will review the PR, there's a policy and they might have their hands tied.

That's true, perhaps my dutch heritage that policies are nice when they make sense ;) I'm also fascinated by the little help you get to get the requested test implemented, essentially blocking the patch, most opensource communities I have encountered so far are happy when you fix or even pinpoint (a long standing) bug.

Back to the environment var: If remember correctly from looking at the code, it influenced some kind of "dropout" and as extra busted the cache (which causes things to work for us), but I don't know what the influence of changing that specific dropout behavior has on training the model. Would be nice if the TF / Nvidia guys can give some comment on that, before we perhaps DS degrade training by missing any side effects.

Exactly.

By the way, I'm wondering do you know how often do we still use the cached version on your larger dataset test ? The difference of 20 seconds is so small, that either:

lissyx commented 4 years ago

That's true, perhaps my dutch heritage that policies are nice when they make sense ;)

Well, even fixing ruy computation on just-released r2.2 was not taken and only merged on master

lissyx commented 4 years ago

I'm also fascinated by the little help you get to get the requested test implemented, essentially blocking the patch, most opensource communities I have encountered so far are happy when you fix or even pinpoint (a long standing) bug.

Well, I can understand why they want that, I guess in their position I'd do the same. Looks like things are moving now, I hope this can go into a 1.15.4 or in the worst case, we need statement on the consequences of the flag.

lissyx commented 4 years ago

The fix landed upstream: https://github.com/tensorflow/tensorflow/pull/41832

lissyx commented 4 years ago

We still have no feedback whether a 1.15.4 can be issued for that.

applied-machinelearning commented 4 years ago

Perhaps we should try to stage it as a multi-stage rocket:

  1. First get the patch applied to the r1.15 tensorflow upstream branch, since it was filled as a bug against that, that seems reasonable and as a bonus it applies clean.
  2. Then try to get a release for that branch.
  3. If we don't get a release, we could try to get it applied to mozilla-tensorflow.
  4. And perhaps even provide an prebuild docker base image for training based on the Dockerfile.build.tmpl file and publish that on docker hub ?
lissyx commented 4 years ago
  • First get the patch applied to the r1.15 tensorflow upstream branch, since it was filled as a bug against that, that seems reasonable and as a bonus it applies clean.

  • Then try to get a release for that branch.

(1) and (2) goes together, it won't get picked on r1.15 if they don't intend to ship 1.15.4

If we don't get a release, we could try to get it applied to mozilla-tensorflow.

What for? Supporting tensorflow wheel builds is a huge tasks, we stopped doing that as soon as we can

And perhaps even provide an prebuild docker base image for training based on the Dockerfile.build.tmpl file and publish that on docker hub ?

Same, that requires us to build and support TensorFlow wheel, which is a lot of work.

applied-machinelearning commented 4 years ago
  • First get the patch applied to the r1.15 tensorflow upstream branch, since it was filled as a bug against that, that seems reasonable and as a bonus it applies clean.
  • Then try to get a release for that branch.

(1) and (2) goes together, it won't get picked on r1.15 if they don't intend to ship 1.15.4

If i look at: https://github.com/tensorflow/tensorflow/commits/r1.15 I do see some (non direct bug fix) commits after 1.15.3 without an immediate release. And even some very recent commits.

If we don't get a release, we could try to get it applied to mozilla-tensorflow.

What for? Supporting tensorflow wheel builds is a huge tasks, we stopped doing that as soon as we can

And perhaps even provide an prebuild docker base image for training based on the Dockerfile.build.tmpl file and publish that on docker hub ?

Same, that requires us to build and support TensorFlow wheel, which is a lot of work.

Depends a bit on what you provide. For the 2.x branches I do agree, but since there is were little (relevant) movement on the 1.15 branch that doesn't require very much (or even any) rebuilding since nothing changes. And the question is if you should build for every target. If it's the most common, x86 and only the python version from the ubuntu cuda dev image, it is all fairly limited put provides for the common training case.

lissyx commented 4 years ago

If i look at: https://github.com/tensorflow/tensorflow/commits/r1.15 I do see some (non direct bug fix) commits after 1.15.3 without an immediate release. And even some very recent commits.

Then maybe they are considering a 1.15.4 ?

Depends a bit on what you provide. For the 2.x branches I do agree, but since there is were little (relevant) movement on the 1.15 branch that doesn't require very much (or even any) rebuilding since nothing changes. And the question is if you should build for every target. If it's the most common, x86 and only the python version from the ubuntu cuda dev image, it is all fairly limited put provides for the common training case.

You are highly underestimating:

Just building r1.15 for the purpose of those debugging steps took several local hacks. Re-using TensorFlow's CI Docker stuff also required a non trivial amount of work.

andrenatal commented 4 years ago

I confirm that the flag addressed my issues and that managed me to train and have a fully functioning model.

lissyx commented 4 years ago

There has been quite a lot of activity on r1.15 branch on TensorFlow, I think we can safely hope for a 1.15.4 that ships without fix now (current upstream r1.15 has merged the fix). I'll close this issue when 1.15.4 ships.

DanBmh commented 4 years ago

Still not working for me with up to date master and newly created docker container. But as mentioned somewhere above, running export TF_CUDNN_RESET_RND_GEN_STATE=1 solved my problem.

lissyx commented 4 years ago

Still not working for me with up to date master and newly created docker container.

Can you triple check if you run 1.15.4 ?

But as mentioned somewhere above, running export TF_CUDNN_RESET_RND_GEN_STATE=1 solved my problem.

Maybe there are some other bugs. As you can see, it was quite painful to investigate already even with a small repro dataset. I'm unfortunately not in the position to have the time to investigate like that anymore for the forseeable future.

DanBmh commented 4 years ago

Can you triple check if you run 1.15.4 ?

Running python3 -c 'import tensorflow as tf; print(tf.__version__)' gives me exactly 1.15.4.

I'm unfortunately not in the position to have the time to investigate like that anymore for the forseeable future.

No problem for me, the solution is easy, so I just will add the extra flag everywhere.


Not sure this helps, but for me the error always gets thrown in validation phase, the first training epoch is finishing without errors. This also happens if I switch train and dev datasets. So I don't think the problem lies in the dataset here.

lissyx commented 4 years ago

I think @applied-machinelearning mentionned something like that on upstream issue ?

applied-machinelearning commented 4 years ago

Yeah it is still on my todo list, but I also still have seen the error at least once. I think you can still have a cache hit while other stuff in the descriptor still differs (from memory .. , I thought rnn_mode was a likely candidate).

I think the pattern for this is when you have the same sequence lengths etc. in both train and dev set. Should be easy testable (just use the same csv (and keep the ordering the same) for both train and dev datasets), but I haven't come around to actually do it. I hope to get to testing this tomorrow or this weekend.

Still wondering if the whole caching idea doesn't do more harm than good. It seems error prone, and if you need to check everything element the cost for checking each time seems non-negligible (as your test seemed to indicate where you didn't find that much difference in training times with or without the TF_CUDNN_RESET_RND_GEN_STATE env var.

Unfortunately there was no reaction from the nvidia guy, seems like it is needed to open a new report. I will after testing.

But perhaps it is still a good idea to implement setting the environment var from deepspeech training code any way ? As I don't think there will be a Tensorflow (1.15.5) release any time soon and most certainly not before a probable deepspeech 1.0 release.

applied-machinelearning commented 4 years ago

Hmm unfortunately I can't reproduce with what I thought could trigger it (run training and validation on the same sorted by wav_size csv's). :(

piraka9011 commented 4 years ago

It was very interesting following this thread! Learned a lot! Wanted to confirm that the suggested fix works: System Specs: Ubuntu 18.04, Nvidia Driver 410.104, Cuda 10.0, CUDNN 7.6.5, Ryzen 3700x, Nvidia GTX 1080

Added export TF_CUDNN_RESET_RND_GEN_STATE=1 and made my training batch size divisible by the number of training samples.

I didn't notice any significant loss in performance.

Edit: I am using the tensorflow/tensorflow:1.15.4-gpu-py3 Docker image as well