mozilla / DeepSpeech

DeepSpeech is an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.
Mozilla Public License 2.0
25.36k stars 3.97k forks source link

Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 31 , 12, 2048] #3088

Closed andrenatal closed 4 years ago

andrenatal commented 4 years ago

For support and discussions, please use our Discourse forums.

If you've found a bug, or have a feature request, then please create an issue with the following information:

set -xe

apt-get install -y python3-venv libopus0

python3 -m venv /tmp/venv

source /tmp/venv/bin/activate

pip install -U setuptools wheel pip

pip install .

pip uninstall -y tensorflow

pip install tensorflow-gpu==1.14

mkdir -p ../keep/summaries

data="${SHARED_DIR}/data" fis="${data}/LDC/fisher" swb="${data}/LDC/LDC97S62/swb" lbs="${data}/OpenSLR/LibriSpeech/librivox" cv="${data}/mozilla/CommonVoice/en_1087h_2019-06-12/clips" npr="${data}/NPR/WAMU/sets/v0.3"

python -u DeepSpeech.py \ --train_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/treino_filtered_alphabet.csv \ --dev_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/dev_filtered_alphabet.csv \ --test_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/teste_filtered_alphabet.csv \ --train_batch_size 12 \ --dev_batch_size 24 \ --test_batch_size 24 \ --scorer ~/projects/corpora/deepspeech-pretrained-ptbr/kenlm.scorer \ --alphabet_config_path ~/projects/corpora/deepspeech-pretrained-ptbr/alphabet.txt \ --train_cudnn \ --n_hidden 2048 \ --learning_rate 0.0001 \ --dropout_rate 0.40 \ --epochs 150 \ --noearly_stop \ --audio_sample_rate 8000 \ --save_checkpoint_dir ~/projects/corpora/deepspeech-fulltrain-ptbr \ --use_allow_growth \ --log_level 0


I'm getting the following error when using my ptbr 8khz dataset to train. Have tried to downgrade and upgrade cuda, cudnn, nvidia-drivers, and ubuntu (16 and 18) and the error persists. I have tried with datasets containing two different characteristics: 6s and 15s in length. Both contain audios in 8khz.

andre@andrednn:~/projects/DeepSpeech$ bash .compute_msprompts

W0618 12:30:10.324707 139639980619584 lazy_loader.py:50] The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:

WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dt ype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a f uture version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326584 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype i s deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0618 12:30:10.401312 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. W0618 12:30:11.297271 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. 2020-06-18 12:30:11.458650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:05:00.0 2020-06-18 12:30:11.459790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:06:00.0 2020-06-18 12:30:11.460897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:09:00.0 2020-06-18 12:30:11.462003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:0a:00.0 2020-06-18 12:30:11.462041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-06-18 12:30:11.462071: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-06-18 12:30:11.462085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-06-18 12:30:11.462097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-06-18 12:30:11.462109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-06-18 12:30:11.462121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-06-18 12:30:11.462133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-18 12:30:11.470539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3 2020-06-18 12:30:11.470679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-18 12:30:11.470694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 1 2 3 2020-06-18 12:30:11.470699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N Y Y Y 2020-06-18 12:30:11.470703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1: Y N Y Y 2020-06-18 12:30:11.470707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2: Y Y N Y 2020-06-18 12:30:11.470710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3: Y Y Y N 2020-06-18 12:30:11.476196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.477355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.478490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.479608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute ca pability: 6.1) D Session opened. I Could not find best validating checkpoint. I Could not find most recent checkpoint. I Initializing all variables. 2020-06-18 12:30:12.233482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 I STARTING Optimization Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2020-06-18 12:30:14.672316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 Epoch 0 | Training | Elapsed Time: 0:00:16 | Steps: 33 | Loss: 18.239303 2 020-06-18 12:30:30.589204: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.param s_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), w orkspace.size(), reserve_space.opaque(), reserve_space.size())' 2020-06-18 12:30:30.589243: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_uni ts, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] Traceback (most recent call last): File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] [[tower_2/CTCLoss/_147]] 1 successful operations. 2 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "DeepSpeech.py", line 12, in ds_train.run_script() File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script absl.app.run(main) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main train() File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 608, in train trainloss, = run_set('train', epoch, train_init_op) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 568, in run_set feed_dict=feed_dict) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[tower_2/CTCLoss/_147]] 1 successful operations. 2 derived errors ignored.

Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3_1': File "DeepSpeech.py", line 12, in ds_train.run_script() File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script absl.app.run(main) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run _run_main(main, args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main sys.exit(main(argv)) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main train()

File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 487, in train gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 313, in get_tower_results avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 240, in calculate_mean_edit_distance_andloss logits, = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 191, in create_model output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 129, in rnn_impl_cudnn_rnn sequence_lengths=seq_length) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in call outputs = super(Layer, self).call(inputs, *args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in call outputs = call_fn(cast_inputs, *args, *kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper return converted_call(f, options, args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call return _call_unconverted(f, args, kwargs, options) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted return f(args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call training) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward seed=self._seed) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn outputs, output_h, outputc, , _ = gen_cudnn_rnn_ops.cudnn_rnnv3(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3 time_major=time_major, name=name) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()

DanBmh commented 4 years ago

I also did get similar errors lately. In my case it often occurs at the end of an epoch. Training works normally for a few epochs before i get the error. Mine has some different numbers than yours:

Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 1101, 30, 2048] 
     [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
     [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]

Reducing the batch size helped me to get this error later in the training, this may be a workaround you can try.

andrenatal commented 4 years ago

I have tried reducing the batch size, but to no avail.

On Fri, Jun 19, 2020, 5:38 AM DanBmh notifications@github.com wrote:

I also did get similar errors lately. In my case it often occurs at the end of an epoch. Training works normally for a few epochs before i get the error. Mine has some different numbers than yours:

Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 1101, 30, 2048] [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]] [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]

Reducing the batch size helped me to get this error later in the training, this may be a workaround you can try.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/mozilla/DeepSpeech/issues/3088#issuecomment-646613063, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHNUTGIYIGEXRHOSPTIC6TRXNL4TANCNFSM4OCC3PYQ .

kdavis-mozilla commented 4 years ago

@andrenatal What version of CuDNN version you are using? Currently with TensorFlow 1.15 depends on CUDA 10.0 and CuDNN v7.6.

andrenatal commented 4 years ago

I tried all versions that @reuben suggested, including CuDNN 7.6

lissyx commented 4 years ago

@andrenatal I know you already tested a lot of things, but this forum entry is interesting: https://forums.developer.nvidia.com/t/gpu-crashes-when-running-machine-learning-models/108252

Can you give it a spin with Python 3.7 ?

Shilpil commented 4 years ago

We tried running it with Python 3.7 but we faced the same error.

lissyx commented 4 years ago

We tried running it with Python 3.7 but we faced the same error.

Then I'm sorry but the only way to get something actionable is bisecting on the dataset to identify the offending files and debug from there.

applied-machinelearning commented 4 years ago

@lissyx As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.

But because if it fails it always consistently fails on the same step and thus batch I tried to isolate stuff. I know have a small subset of my large dataset and that always fails on epoch 27 with batch size 32, so it's under 1500 samples and thus manageable in size.

I made some discoveries though:

So it seems that the combination (and probably order) of certain samples in a batch blows up with CUDNN consistently (and in any other combination or order, they don't).

I think the dataset subset is small enough to provide you with (around 20mb of samples), if that could help you determine as to why it actually blows up. (and provide the docker build script, run script, logging and the patches I applied to the v0.7.4 tree (only the printing of files in the batches and replacing the sort with the shuffle).

lissyx commented 4 years ago

I think the dataset subset is small enough to provide you with (around 20mb of samples), if that could help you determine as to why it actually blows up.

If it's a bug in TensorFlow / CUDNN, it's hardly something we can help about. I'm already lacking time for a lot of other urgents matters, and it seems you have more background and knowledge on the issue than I do ...

lissyx commented 4 years ago
* So I tried  with the sorting from the sample loading replaced with a random.shuffle(), and training with CUDNN now doesn't blow up. Even with the whole dataset (about 280000 samples).

It would still be interesting if you could share the order when it works, when it fails, and where it fails.

applied-machinelearning commented 4 years ago

I think the dataset subset is small enough to provide you with (around 20mb of samples), if that could help you determine as to why it actually blows up.

If it's a bug in TensorFlow / CUDNN, it's hardly something we can help about. I'm already lacking time for a lot of other urgents matters, and it seems you have more background and knowledge on the issue than I do ...

Merely reduced the problem-space, not of the tensorflow / deepspeech internals. And it would be nice if people could confirm (so it can be semi-worked around by not sorting).

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ? (I ask, since the whole chain of cuda 10, tensorflow 1.15 etc. is probably unsupported by Nvidia as well, so we probably won't get any support from that side as well. And as there are several people now reporting issues with training on current deepspeech in this thread ...)

applied-machinelearning commented 4 years ago
* So I tried  with the sorting from the sample loading replaced with a random.shuffle(), and training with CUDNN now doesn't blow up. Even with the whole dataset (about 280000 samples).

It would still be interesting if you could share the order when it works, when it fails, and where it fails.

What would you like to have shared, only the csv or also the samples (as I think it would be somewhere in the samples and not the transcripts (but of course I could be wrong) ?

lissyx commented 4 years ago

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?

@reuben Had a look at that, he knows better.

What would you like to have shared, only the csv or also the samples (as I think it would be somewhere in the samples and not the transcripts (but of course I could be wrong) ?

I think you should need to share audio + csv

Merely reduced the problem-space, not of the tensorflow / deepspeech internals. And it would be nice if people could confirm (so it can be semi-worked around by not sorting).

Sure, but given the current workload, I really cannot promise having time to reproduce that: I am still lagging behind a lot of other super-urgents matters, sadly (thank you covid-19).

reuben commented 4 years ago

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?

Lots

applied-machinelearning commented 4 years ago

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?

Lots That is unfortunate.

Another question is, I saw the inference side of DeepSpeech seems to work now on tensorflow 2.x, how much work would the training side be ?

@reuben Had a look at that, he knows better.

What would you like to have shared, only the csv or also the samples (as I think it would be somewhere in the samples and not the transcripts (but of course I could be wrong) ?

I think you should need to share audio + csv

OK, will do.

Merely reduced the problem-space, not of the tensorflow / deepspeech internals. And it would be nice if people could confirm (so it can be semi-worked around by not sorting).

Sure, but given the current workload, I really cannot promise having time to reproduce that: I am still lagging behind a lot of other super-urgents matters, sadly (thank you covid-19).

OK, I will do some more experiments then, try to pinpoint it some more. Try to find out if only the batch content matters, or also the state the graph /weights are in from the previous steps. If only the batch content matters, I will test what happens if you only shuffle that.

applied-machinelearning commented 4 years ago

@lissyx @reuben

Got the results of my extended testing based on a minimalistic dataset of 3x 32 samples, as I use a batchsize of 32, that is 3 steps. I named the batches A, B and C and as a whole they are ordered by wav_filesize.

I have done runs with all sorts of combinations of these batches (concatenated in the order of the name of the csv file), if appended with a "s" that batch in itself is still ordered by wav_filesize, if appended with a "r" that batch is randomly shuffled. The runs do 3 epochs.

In the tar.gz file I included:

As a summary of the results:

train_debug_Ar_Br_Cr.csv, blows up in step 1, which is batch B train_debug_Ar_Br_Cs.csv, blows up in step 1, which is batch B train_debug_Ar_Bs_Cs.csv, blows up in step 1, which is batch B train_debug_As_Br_Cs.csv, blows up in step 1, which is batch B train_debug_As_Bs_Cs.csv, blows up in step 1, which is batch B train_debug_As_Cs_Bs.csv, blows up in step 2, which is batch B train_debug_As_Cs.csv, OK train_debug_Bs_Cs.csv, OK train_debug_Cs_As_Bs.csv, blows up in step 2, which is batch B train_debug_Cs_As.csv, OK train_debug_Cs_Bs_As.csv, blows up in step 1, which is batch B train_debug_Cs_Bs.csv, blows up in step 1, which is batch B train_debug_interbatch_random: All variants: OK

My interpretation of these results:

  1. If it blows up, it is always at a step with batch B.
  2. It always blows up with the contents of this batch B, unless batch B is the very first step.
  3. The order of the files within batch B doesn't matter.
  4. It happens independent of the previous batches/steps (with the exemption of B being the first batch)
  5. All inter-batch randomized variants run fine.

But what is so special about the content of batch B that it blows up with CUDNN ...

(before you ask, it is not only this batch B, there are multiple such batches in my large datasets, this is one example with the shortest samples) deepspeech_v0.7.4_cudnn_debug.tar.gz

lissyx commented 4 years ago

Nice @applied-machinelearning. Do you think you could even reduce batch B to a smaller set of files ? Maybe if we can know which file(s) triggers the behavior it might be easier to know about / check ?

applied-machinelearning commented 4 years ago

Nice @applied-machinelearning. Do you think you could even reduce batch B to a smaller set of files ? Maybe if we can know which file(s) triggers the behavior it might be easier to know about / check ?

I could try by reducing the training batch size, see if I can find even smaller batches that fail (from previous tests I think it will end at either 2 or 4 (but not 1), will give it a try tomorrow.

lissyx commented 4 years ago

As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.

So I see you are basically reusing the TensorFlow official Docker image and you got inspiration from https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train :)

That's good, that should make it easier for us to try and reproduce locally. Can you share more details on your actual underlying system and hardware, in case it might be related?

applied-machinelearning commented 4 years ago

@lissyx @reuben

OK I have done some more runs:

I ran train_debug_As_Bs_Cs.csv with batch sizes 1 and 2:

Batch size 1 trains fine. Batch size 2 blows up on the step with files: B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav

So I made some new csv files with:

batch A: two files from the original batch A
batch B: two files B/98_2923 and B/154_4738 from batch B
batch C: two files from the original batch C

And I made some variant of that:

train_debug_mini_As_Bs_Cs.csv
train_debug_mini_Bs_As_Cs.csv
train_debug_mini_Bs_As_Cs_B_swapped.csv
train_debug_mini_As_Bs_Cs_B_swapped.csv
train_debug_mini_As_Bs_Cs_B_mixed_A.csv
train_debug_mini_As_Bs_Cs_B_mixed_C.csv
train_debug_mini_As_Bs_Cs_B_mixed_C_2.csv
train_debug_mini_As_Bs_Cs_B_swapped_C_mixed.csv

The results of that:

With batch size 1, these all workout fine (as expected).
With batch size 2:
train_debug_mini_As_Bs_Cs.csv
    blows up in step 1, which is batch B.

train_debug_mini_As_Bs_Cs_B_swapped.csv
    blows up in step 1, which is batch B, so swapping the order within B doesn't make a difference.

train_debug_mini_Bs_As_Cs.csv
    works fine, B is the first step 0.
    as expected as the first step seems to be a special case.

train_debug_mini_Bs_As_Cs_B_swapped.csv
    works fine, B is the first step 0, so swapping the order in B doesn't make a difference.
    as expected as the first step seems to be a special case.

train_debug_mini_As_Bs_Cs_B_mixed_A.csv
    blows up in step 1, which is:
        A/155_4757
        B/154_4738

train_debug_mini_As_Bs_Cs_B_mixed_C.csv
    blows up in step 1, which is:
        B/98_2923
        C/169_5271

train_debug_mini_As_Bs_Cs_B_mixed_C_2.csv
    blows up in step 1, which is:
        C/169_5271
        B/98_2923

train_debug_mini_As_Bs_Cs_B_swapped_C_mixed.csv
    blows up in step 2, which is:
        B/98_2923
        C/169_5271

    while it did complete step 1, which is:
        B/154_4738
        C/175_5429

My interpretation of this all:

So it is a bit odd, I'm starting to wonder if this is some edge case where we hit some math operation blowing up. But both files from B have slightly different file sizes and both blow up in combinations with other files with slightly different file sizes (from A and C).

So I'm a bit lost now, you have more insight in how things get processed, hopefully you have some more ideas based on that.

CSV's and logs are attached (sample files from the previous post can be used) train_debug_mini.tar.gz

applied-machinelearning commented 4 years ago

As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.

So I see you are basically reusing the TensorFlow official Docker image and you got inspiration from https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train :) I think it was an Italian DS/CV repo I drew inspiration from, but they probably took it from the French one ;). Previously I also tried with a docker build with ubuntu18.04-cuda10 image as a base, with tensorflow-gpu 1.15.3.

That's good, that should make it easier for us to try and reproduce locally. Can you share more details on your actual underlying system and hardware, in case it might be related?

Host is an AMD Ryzen system with 32GB of mem and a GTX 1070 with 8GB of mem, running Debian. Host Nvidia driver is now 440.100 (but I have tried several others, still the same problems). If you need more specifics, please indicate what info you need more.

Thanks for looking into it !

lissyx commented 4 years ago

As I am also effected by this, I tried everything from python versions, different dockerbuild, different host drivers, checking my dataset for evident errors, all had no effect.

So I see you are basically reusing the TensorFlow official Docker image and you got inspiration from https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train :) I think it was an Italian DS/CV repo I drew inspiration from, but they probably took it from the French one ;). Previously I also tried with a docker build with ubuntu18.04-cuda10 image as a base, with tensorflow-gpu 1.15.3.

That's good, that should make it easier for us to try and reproduce locally. Can you share more details on your actual underlying system and hardware, in case it might be related?

Host is an AMD Ryzen system with 32GB of mem and a GTX 1070 with 8GB of mem, running Debian. Host Nvidia driver is now 440.100 (but I have tried several others, still the same problems). If you need more specifics, please indicate what info you need more.

Thanks for looking into it !

Thanks, running Sid as well here, so I'm on similar setup, except I have 2x (faster, more memory) GPUs. I hope it will still allow me to repro.

applied-machinelearning commented 4 years ago

Thanks, running Sid as well here, so I'm on similar setup, except I have 2x (faster, more memory) GPUs. I hope it will still allow me to repro.

I'm running buster on that machine, when I woke up this morning it dawned on me I forgot to post the hyperparameter stuff. So attached is the script I used in the docker container to run the tests. Feature cache, checkpoint dir etc, all get cleaned up before the run.

run_deepspeech_var_batchsize.sh.tar.gz

I hope you can reproduce and spot something !

lissyx commented 4 years ago

Thanks, running Sid as well here, so I'm on similar setup, except I have 2x (faster, more memory) GPUs. I hope it will still allow me to repro.

I'm running buster on that machine, when I woke up this morning it dawned on me I forgot to post the hyperparameter stuff. So attached is the script I used in the docker container to run the tests. Feature cache, checkpoint dir etc, all get cleaned up before the run.

run_deepspeech_var_batchsize.sh.tar.gz

I hope you can reproduce and spot something !

Looks like clean.sh is missing, as well as FATAL Flags parsing error: flag --alphabet_config_path=./data/lm/plaintext_alpha.txt: The file pointed to by --alphabet_config_path must exist and be readable. . I don't want to sound rude, but if you could just assemble a dump-proof Docker or script to repro minimally the issue, there are already enough complexity and variables interacting, I really need to be 1000% sure to repro your exact step to assert whether I can reproduce the issue :/

lissyx commented 4 years ago

I'm not even able to get CUDA working so far in the dockerfile :/

lissyx commented 4 years ago

I'm not even able to get CUDA working so far in the dockerfile :/

Seems to be the same old weird nvidia/cuda/docker bug, after ldconfig it works:

tf-docker ~ > sudo ldconfig
tf-docker ~ > nvidia-smi 
Thu Jul  9 10:16:44 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100      Driver Version: 440.100      CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:21:00.0 Off |                  N/A |
|  0%   34C    P8     1W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce RTX 208...  On   | 00000000:4B:00.0 Off |                  N/A |
|  0%   35C    P8    20W / 250W |      0MiB / 11019MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
tf-docker ~ > python -c "import tensorflow as tf; tf.test.is_gpu_available()"
2020-07-09 10:16:48.233166: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-07-09 10:16:48.264242: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2900325000 Hz
2020-07-09 10:16:48.271101: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5d55f00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-07-09 10:16:48.271144: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-07-09 10:16:48.272884: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-07-09 10:16:54.029647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.046529: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.047194: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5d58840 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-07-09 10:16:54.047218: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-07-09 10:16:54.047253: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2020-07-09 10:16:54.047656: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.048468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:21:00.0
2020-07-09 10:16:54.048551: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.049324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: 
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:4b:00.0
2020-07-09 10:16:54.049585: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-09 10:16:54.057643: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-07-09 10:16:54.061562: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-07-09 10:16:54.066658: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-07-09 10:16:54.077684: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-07-09 10:16:54.081287: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-07-09 10:16:54.107985: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-07-09 10:16:54.108254: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.109206: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.110043: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.110885: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.111644: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1
2020-07-09 10:16:54.111707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-07-09 10:16:54.113783: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-09 10:16:54.113802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0 1 
2020-07-09 10:16:54.113811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N N 
2020-07-09 10:16:54.113821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1:   N N 
2020-07-09 10:16:54.113979: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.114808: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.115627: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.116444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 10311 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:21:00.0, compute capability: 7.5)
2020-07-09 10:16:54.117023: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-09 10:16:54.117508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:1 with 10311 MB memory) -> physical GPU (device: 1, name: GeForce RTX 2080 Ti, pci bus id: 0000:4b:00.0, compute capability: 7.5)
lissyx commented 4 years ago

@applied-machinelearning Good news, I repro your issue.

lissyx commented 4 years ago

@applied-machinelearning Not only I repro, but apt update && apt upgrade changes the issue: first it was exploding at epoch 1, now at epoch 2.

lissyx commented 4 years ago

Several people report similar issue with NVIDIA drivers above a certain version: https://github.com/tensorflow/tensorflow/issues/35950#issuecomment-577427083, and 431.36 would be a working one.

lissyx commented 4 years ago

https://forums.developer.nvidia.com/t/cudnn-lstm-is-broken-above-driver-431-60-unexpected-event-status-1-cuda/108800

lissyx commented 4 years ago

Fun: gpu_options=tfv1.GPUOptions(per_process_gpu_memory_fraction=0.05) triggers the issue at the very begining

applied-machinelearning commented 4 years ago

Looks like clean.sh is missing, as well as FATAL Flags parsing error: flag --alphabet_config_path=./data/lm/plaintext_alpha.txt: The file pointed to by --alphabet_config_path must exist and be readable.. I don't want to sound rude, but if you could just assemble a dump-proof Docker or script to repro minimally the issue, there are already enough complexity and variables interacting, I really need to be 1000% sure to repro your exact step to assert whether I can reproduce the issue :/

Sorry for that, didn't expect you to run it literally.

Several people report similar issue with NVIDIA drivers above a certain version: tensorflow/tensorflow#35950 (comment), and 431.36 would be a working one.

Thanks for figuring this out, didn't come up with my google-foo.

Hmm I will see if can give that driver a shot this evening, although I can't find 431.36 in the download archive at https://www.nvidia.com/en-us/drivers/unix/linux-amd64-display-archive/

This one seems to be the closest:

Version: 430.64 Operating System: Linux 64-bit Release Date: November 5, 2019

And it probably means downgrading the kernel as well to something semi-ancient :( (edit: hmm from the description it should compile with kernel 5.4, not too ancient)

lissyx commented 4 years ago

Version: 430.64 Operating System: Linux 64-bit Release Date: November 5, 2019

And it probably means downgrading the kernel as well to something semi-ancient :(

On Buster you might have more chances to succeed compared to me on Sid.

applied-machinelearning commented 4 years ago

On Buster you might have more chances to succeed compared to me on Sid. Kernels should be fairly independent of the rest of the system.

Are you going to / do you know, the best way to address this with Nvidia ? (seems the problem itself has been noted for quite some time without a fix appearing in newer drivers)

lissyx commented 4 years ago

On Buster you might have more chances to succeed compared to me on Sid. Kernels should be fairly independent of the rest of the system.

Are you going to / do you know, the best way to address this with Nvidia ? (seems the problem itself has been noted for quite some time without a fix appearing in newer drivers)

I have no idea ?

lissyx commented 4 years ago

There are some hints on some of the reports it might be related to the ordering of sequence_length, i'd like to get a better grasp at that, confirm and so maybe we could at least have some tooling / workaround to help about that.

@applied-machinelearning For fun, at some point, some combination of dataset, driver and tensorflow version on our codebase would trigger a power surge on my hardware at home, and it was too much for my PSU that was shutting down :/

lissyx commented 4 years ago

@applied-machinelearning While not a workaround I like, but it seems to help moving forward that changing to TensorFlow 1.14 gets me through the small example

Is it something you could test on full / repro dataset on your side ?

lissyx commented 4 years ago

FROM tensorflow/tensorflow:1.14.0-gpu-py3

applied-machinelearning commented 4 years ago

Sure will test that before changing the driver.

lissyx commented 4 years ago

Sure will test that before changing the driver.

Like, I'm not sure if it's not just a side-effect that different tensorflow version might schedule things differently, as you said it was a point that matters, or if it's because it depends on cudnn 7.4 instead of 7.6 and it might behave differently on that point.

applied-machinelearning commented 4 years ago

Hmm a bit busy and tired this evening, so I will postpone most testing till tomorrow, but I have done some tests with tensorflow/tensorflow:1.14.0-gpu-py3 and the 440.100 driver (the one I used with the failing tf1.15 image tests as well).

Done all tests except the full dataset one (so 1500, the 3x32 batches and the 3x2 batches) and all succeed with the tf-1.14 image, so I think you are correct. Still debatable if its TF or cudnn, but if I would have to bet, I would bet on the different cudnn version.

Will test the driver downgrade tomorrow and after that a run on the full dataset.

lissyx commented 4 years ago

Hmm a bit busy and tired this evening, so I will postpone most testing till tomorrow, but I have done some tests with tensorflow/tensorflow:1.14.0-gpu-py3 and the 440.100 driver (the one I used with the failing tf1.15 image tests as well).

Done all tests except the full dataset one (so 1500, the 3x32 batches and the 3x2 batches) and all succeed with the tf-1.14 image, so I think you are correct. Still debatable if its TF or cudnn, but if I would have to bet, I would bet on the different cudnn version.

Will test the driver downgrade tomorrow and after that a run on the full dataset.

OK, good to know we make progress. I'm trying to check how sequence_length variations are related

applied-machinelearning commented 4 years ago

I ran the test with different drivers, preliminary results (will do a long test after this):

Nvidia host driver docker base image short tests long test
440.100 tensorflow/tensorflow:1.14.0-gpu-py3 worked
440.100 tensorflow/tensorflow:1.15.2-gpu-py3 failed
430.64 tensorflow/tensorflow:1.14.0-gpu-py3 worked
430.64 tensorflow/tensorflow:1.15.2-gpu-py3 failed
450.57 tensorflow/tensorflow:1.14.0-gpu-py3 worked worked
450.57 tensorflow/tensorflow:1.15.2-gpu-py3 failed failed

440.100 was the driver I was using originally. 430.64 the driver downloadable just below the 431.36 that was reported as working on the TF forum (could be the versioning of Nvidia is different so it is actually not below 431.36, but it was my best guess). 450.57 the latest stable driver released yesterday.

So from this I would take that the host driver version doesn't matter. And I haven't been able to prove that the TF14 image doesn't work :)

Will start a long test now with the TF14 image.

lissyx commented 4 years ago

I just verified and I repro with cudnn v7.6.1 as well. I think I should try and rebuild tf 1.15.2 docker with cudnn 7.6, 7.5 and 7.4 to assert here.

applied-machinelearning commented 4 years ago

Updated the table above, I think I'm convinced enough to say that the TF14 image doesn't have the problem. Hope you succeed in pinning it to a particular cudnn version.

lissyx commented 4 years ago

Ok, required a bit of hacking but I leveraged TensorFlow's CI builds script to produce some 1.15.2 CUDA-enabled python 3.6 wheels with different cudnn7 linkage, currently I have 7.4 done, 7.5 in progress and soon finished. Next step are:

lissyx commented 4 years ago

So, TensorFlow r1.15.2 CUDA 10.0, with host driver 440.100:

lissyx commented 4 years ago

So, TensorFlow r1.15.2 CUDA 10.0, with host driver 440.100:

* libcudnn 7.6.5.32: fail

* libcudnn 7.5.1.10: fail

* libcudnn 7.4.2.24: success

To build the TensorFlow wheels:

lissyx commented 4 years ago

To repro the issue:

lissyx commented 4 years ago

@applied-machinelearning It would be awesome if you could cross-check on your side, with just varying the cudnn version we limit the risks of the issue being just masked by different tensorflow version.