Closed andrenatal closed 4 years ago
Still no repro with this reverted after 20 tries, I guess we are on track now.
revert-24297a4cb9120351643f7ac3916e7398236ccc0d.patch.txt
@applied-machinelearning revert of this is a bit non trivial, here is the diff to it, if you are willing to rebuild tensorflow gpu completely (might take some hours depending on your hw)
revert-24297a4cb9120351643f7ac3916e7398236ccc0d.patch.txt
@applied-machinelearning revert of this is a bit non trivial, here is the diff to it, if you are willing to rebuild tensorflow gpu completely (might take some hours depending on your hw)
Do you have a docker build file for that laying around or did you do it on baremetal ?
revert-24297a4cb9120351643f7ac3916e7398236ccc0d.patch.txt @applied-machinelearning revert of this is a bit non trivial, here is the diff to it, if you are willing to rebuild tensorflow gpu completely (might take some hours depending on your hw)
Do you have a docker build file for that laying around or did you do it on baremetal ?
I did it baremetal, ill continue tomorrow to see if we can act from python side instead of rebuilding
looks like it's not something we have leverage on from python code, i'll run a few checks to see if there anything obvious and if not, then we'll follow up with tensorflow issue
Another weird behavior:
ShouldUsePaddedIO
to return false breaks the trainingSo maybe it needs more digging into that patch itself.
Interestingly enough from the pull request it seems they also had intermittent test failures, but it seems they were not addressed.
With a stock TF 1.15 the error message indicates we are running cudnnRNNForwardTrainingEx() when it breaks. The cudnn docs indicate it is version of the function for padded data: https://docs.nvidia.com/deeplearning/sdk/cudnn-archived/cudnn_765/cudnn-api/index.html#cudnnRNNForwardTrainingEx
We don't get CUDNN_STATUS_BAD_PARAM back, so it seems to accept the parameters listed there and not blow up immediatly in those. We get CUDNN_STATUS_EXECUTION_FAILED.
It also lists some conditions for the data layout:
This routine is the extended version of the cudnnRNNForwardTraining() function. The cudnnRNNForwardTrainingEx() allows the user to use unpacked (padded) layout for input x and output y.
In the unpacked layout, each sequence in the mini-batch is considered to be of fixed length, specified by maxSeqLength in its corresponding RNNDataDescriptor. Each fixed-length sequence, for example, the nth sequence in the mini-batch, is composed of a valid segment specified by the seqLengthArray[n] in its corresponding RNNDataDescriptor; and a padding segment to make the combined sequence length equal to maxSeqLength.
And also for the order within the mini-batch in a special case:
With the unpacked layout, both sequence major (meaning, time major) and batch major are supported. For backward compatibility, the packed sequence major layout is supported. However, similar to the non-extended function cudnnRNNForwardTraining(), the sequences in the mini-batch need to be sorted in descending order according to length.
My interpretation of the last piece would be: You can stuff packed/unpadded sequences into cudnnRNNForwardTrainingEx() although it is meant for unpacked/padded, on the premises that the sequences are sorted in descending order according to length.
Which functions did your test blowup with when "forcing ShouldUsePaddedIO to return false", cudnnRNNForwardTrainingEx() of cudnnRNNForwardTraining(), so the extended or the not extended version ?
Unfortunately it's quite a pain to get stuff printed at some interesting places ... the log output of docker images you made issue3088_7.6.5.3 etc., do output a lot of extra internal data, but I find it hard to interpret.
Which functions did your test blowup with when "forcing ShouldUsePaddedIO to return false", cudnnRNNForwardTrainingEx() of cudnnRNNForwardTraining(), so the extended or the not extended version ?
It looks like very much the same:
2020-07-22 10:20:21.132203: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 10:20:21.132242: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048]
This is with forcing this way:
diff --git a/tensorflow/core/kernels/cudnn_rnn_ops.cc b/tensorflow/core/kernels/cudnn_rnn_ops.cc
index 4a27394f28..c3aa386ea7 100644
--- a/tensorflow/core/kernels/cudnn_rnn_ops.cc
+++ b/tensorflow/core/kernels/cudnn_rnn_ops.cc
@@ -1463,8 +1463,8 @@ class CudnnRNNForwardOp<GPUDevice, T> : public CudnnRNNKernelCommon {
context, model_types(), time_major, &input,
&input_h, &input_c, ¶ms,
&sequence_lengths, num_proj, &model_shapes));
- use_padded_io =
- ShouldUsePaddedIO(sequence_lengths, model_shapes, time_major);
+ use_padded_io = false;
+ // ShouldUsePaddedIO(sequence_lengths, model_shapes, time_major);
} else {
OP_REQUIRES_OK(context,
ExtractForwardInput(context, model_types(), time_major,
@@ -1863,8 +1863,8 @@ class CudnnRNNBackwardOp<GPUDevice, T> : public CudnnRNNKernelCommon {
context, model_types(), time_major, &input,
&input_h, &input_c, ¶ms,
&sequence_lengths, num_proj, &model_shapes));
- use_padded_io =
- ShouldUsePaddedIO(sequence_lengths, model_shapes, time_major);
+ use_padded_io = false;
+ // ShouldUsePaddedIO(sequence_lengths, model_shapes, time_major);
} else {
OP_REQUIRES_OK(context,
ExtractForwardInput(context, model_types(), time_major,
Not what I would have expected, but forcing true for use_padded_io
does indeed ... work? I have four retries are that working. It's ... weird.
Not what I would have expected, but forcing true for
use_padded_io
does indeed ... work? I have four retries are that working. It's ... weird.
Or it's consistent: our data is requiring padding to be padded.
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2
020-07-22 11:09:21.092508: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
Epoch 0 | Training | Elapsed Time: 0:00:05 | Steps: 1 | Loss: 189.949219
--------------------------------------------------------------------------------
Epoch 1 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 S
houldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
2020-07-22 11:09:30.528222: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 11:09:30.528291: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1522 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048]
Captured also the debug I added on a non repro case:
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2
020-07-22 11:30:10.205219: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
Epoch 0 | Training | Elapsed Time: 0:00:01 | Steps: 1 | Loss: 189.949173
Epoch 1 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 S
houldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
Epoch 1 | Training | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 107.969971
Epoch 2 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 S
houldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
Epoch 2 | Training | Elapsed Time: 0:00:00 | Steps: 1 | Loss: 73.916595
I FINISHED optimization in 0:00:03.484708
D Session closed.
Those two logs seems to fit your analysis @applied-machinelearning: it crashes when the max_seq_length=75
is not the first value we push.
Could it be handy to not short circuit the loop in the ShouldUsePaddedIO() and let it print out the whole lot of sequence sizes and their order in that mini-batch, before returning to get it completely clear (or print out the whole seq_array at the start) ?
And it doesn't seem to need it completely sorted in the minibatch either (if you look at the non-repro case [75, 74, 74, 75, 74, 74] seems to work as well. So it looks like at least the first max_sequence_length should be the or one of the largest for the mini-batch ?
Could it be handy to not short circuit the loop in the ShouldUsePaddedIO() and let it print out the whole lot of sequence sizes and their order in that mini-batch, before returning to get it completely clear (or print out the whole seq_array at the start) ?
Those latest logs where not produced by forcing any return value to ShouldUsePaddedIO
There is some:
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2
020-07-22 13:18:12.007877: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[('csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav', 48366, 'en zijn huisje verlaten was'), ('csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav', 48366, 'dat vertrokken mondje hij'), ('csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav', 48368, 'was zo woest dat'), ('csvs/../d
ata/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav', 48520, 'hij gaf geen antwoord'), ('csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav', 48524, 'en in de tien minuten die de lift'), ('csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav', 48524, 'informeerde hij of die')]
CudnnRNNForwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=75
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
CudnnRNNBackwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=75
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
CudnnRNNForwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=74
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
CudnnRNNBackwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=74
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
Epoch 0 | Training | Elapsed Time: 0:00:05 | Steps: 1 | Loss: 189.949219
--------------------------------------------------------------------------------
Epoch 1 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 [
('csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav', 48366, 'en zijn huisje verlaten was'), ('csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav', 48366, 'dat vertrokken mondje hij'), ('csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav', 48368, 'was zo woest dat'), ('csvs/../da
ta/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav', 48520, 'hij gaf geen antwoord'), ('csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav', 48524, 'en in de tien minuten die de lift'), ('csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav', 48524, 'informeerde hij of die')]
CudnnRNNForwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=74
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
CudnnRNNForwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=75
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
2020-07-22 13:18:20.837671: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
Hm, it's not that clear. Here is a log after reversing the ordering when we read CSV file. As you can see, 75 is now first one and yet it fails.=:
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2
020-07-22 15:01:00.634318: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[('csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav', 48366, 'en zijn huisje verlaten was'), ('csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav', 48366, 'dat vertrokken mondje hij'), ('csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav', 48368, 'was zo woest dat'), ('csvs/../d
ata/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav', 48520, 'hij gaf geen antwoord'), ('csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav', 48524, 'en in de tien minuten die de lift'), ('csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav', 48524, 'informeerde hij of die')]
CudnnRNNForwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=74
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
CudnnRNNBackwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=74
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
CudnnRNNForwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=75
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
CudnnRNNBackwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=75
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
Epoch 0 | Training | Elapsed Time: 0:00:05 | Steps: 1 | Loss: 190.842316
--------------------------------------------------------------------------------
Epoch 1 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 [
('csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav', 48366, 'en zijn huisje verlaten was'), ('csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav', 48366, 'dat vertrokken mondje hij'), ('csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav', 48368, 'was zo woest dat'), ('csvs/../da
ta/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav', 48520, 'hij gaf geen antwoord'), ('csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav', 48524, 'en in de tien minuten die de lift'), ('csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav', 48524, 'informeerde hij of die')]
CudnnRNNForwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=75
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
2020-07-22 15:01:09.398352: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 15:01:09.398438: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048]
CudnnRNNForwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=74
ShouldUsePaddedIO seq_array[1]=74
ShouldUsePaddedIO [0]: seq_array[i]=74
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=74
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=74
ShouldUsePaddedIO rv=false all_max_seq_length=true
Hm, it's not that clear. Here is a log after reversing the ordering when we read CSV file. As you can see, 75 is now first one and yet it fails.=:
Are you sure ? The mini-batch sequence array still seems [74, 75] and not [75, 74] ?
Hm, it's not that clear. Here is a log after reversing the ordering when we read CSV file. As you can see, 75 is now first one and yet it fails.=:
Are you sure ? The mini-batch sequence array still seems [74, 75] and not [75, 74] ?
Are you referring to those lines?
ShouldUsePaddedIO seq_array[0]=74 ShouldUsePaddedIO seq_array[1]=75
Yes.
Well, I'm not 100% sure because we also have logs where there is this seq_array
ordering and it succeeds
Reversing the order, and still explodes:
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2
020-07-22 15:59:45.113520: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
generate_values <deepspeech_training.util.sample_collections.CSV object at 0x7f63846338d0>
yield generate_values 0 csvs/../data/C/175_5429_67ed7914b9a3bac4e46dd42a5721a95f-e31a33c85ca8249476596c1ff7fc2f67.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a14d0>
yield generate_values 1 csvs/../data/C/169_5271_3210ac3e97626f9c1515cb019e5fa36e-dd839274af12610f137398ddd01f85f8.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a15d0>
yield generate_values 2 csvs/../data/B/154_4738_2f841fb1af523c579414e0358ab16295-6aea9aa95b1bdbfd80703754cd8a180c.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a1150>
yield generate_values 3 csvs/../data/B/98_2923_a387275540ba5f2159c37eaee3e4e9a0-651926517a6241fd9bb5942777b1f0ff.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a1550>
yield generate_values 4 csvs/../data/A/163_5029_3498779ce37873475394654801cc3888-8fddd9522baf442463171802a7e57489.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a1750>
batch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>
batch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>
yield generate_values 5 csvs/../data/A/155_4757_9bc6d6f754547a09bbcf70e42d8e2a27-b112945da6818223ab8e1daf80313a62.wav <deepspeech_training.util.sample_collections.LabeledSample object at 0x7f67465a1190>
CudnnRNNForwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=75
ShouldUsePaddedIO seq_array[1]=75
ShouldUsePaddedIO [0]: seq_array[i]=75
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO [1]: seq_array[i]=75
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=false all_max_seq_length=true
CudnnRNNBackwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=75
ShouldUsePaddedIO seq_array[1]=75
ShouldUsePaddedIO [0]: seq_array[i]=75
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO [1]: seq_array[i]=75
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=false all_max_seq_length=true
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 1 | Loss: 189.891312 b
atch_fn <_VariantDataset shapes: <unknown>, types: tf.string> 2 <_VariantDataset shapes: (?, 26), types: tf.float32> <_VariantDataset shapes: (), types: tf.int32>
CudnnRNNForwardOp
ShouldUsePaddedIO time_major=1
ShouldUsePaddedIO seq_array[0]=75
ShouldUsePaddedIO seq_array[1]=74
ShouldUsePaddedIO [0]: seq_array[i]=75
ShouldUsePaddedIO [0]: model_shapes.max_seq_length=75
ShouldUsePaddedIO [1]: seq_array[i]=74
ShouldUsePaddedIO [1]: model_shapes.max_seq_length=75
ShouldUsePaddedIO rv=true all_max_seq_length=false
2020-07-22 15:59:48.820700: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED
in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.params_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_de
sc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), workspace.size(), reserve_space.opaque(), reserve_space.size())'
2020-07-22 15:59:48.820756: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1527 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 75, 2, 2048]
@applied-machinelearning I don't know what you think, but this defies all the assumptions I can make based on cudnn api doc and what we observe. There's something that is missing to justify the trigger of the issue, and so far, hacking the ordering does not seems to really be the real trigger here, but I can't figure it out, and with just "CUDNN_STATUS_EXECUTION_FAILED" as a feedback and no really usable debug information because of the closedness of CUDA, I don't see how we can investigate more without wasting our time on that.
I think it all of this should be enough to issue a bug and directly ping the committer from the bisected commit and the TF var_length_sequence stuff. As they also can double check the CUDNN code and have more insight in all the (data) requirements.
Ah I see you just did that, apart from mentioning the Nvidia committer (could be worthwhile to get some attention from the relevant people faster).
I think it all of this should be enough to issue a bug and directly ping the committer from the bisected commit and the TF var_length_sequence stuff. As they also can double check the CUDNN code and have more insight in all the (data) requirements.
Ah I see you just did that, apart from mentioning the Nvidia committer (could be worthwhile to get some attention from the relevant people faster).
Indeed, I was preparing extra info and pinged this person as well. Let's hope they can quickly assert on their side and come back to us.
Great and thanks again for all your effort so far !
@applied-machinelearning So, we've got some feedback from the nvidia dev, and it seems TF_CUDNN_RESET_RND_GEN_STATE=1
does help here. I'm unsure of the implications, especially in term of performances, but maybe you can give that a try on your full dataset, this could help us assert:
@applied-machinelearning So, we've got some feedback from the nvidia dev, and it seems
TF_CUDNN_RESET_RND_GEN_STATE=1
does help here. I'm unsure of the implications, especially in term of performances, but maybe you can give that a try on your full dataset, this could help us assert:* it is indeed related to the issue * have an idea of the perf impact
Was away from keyboard this weekend, running tests now. The short tests work with that ENV variable set, now running the longer one. Edit: The long test also works.
@applied-machinelearning So, we've got some feedback from the nvidia dev, and it seems
TF_CUDNN_RESET_RND_GEN_STATE=1
does help here. I'm unsure of the implications, especially in term of performances, but maybe you can give that a try on your full dataset, this could help us assert:* it is indeed related to the issue * have an idea of the perf impact
Was away from keyboard this weekend, running tests now. The short tests work with that ENV variable set, now running the longer one. Edit: The long test also works.
I'm presently running one or two training epochs with TF_CUDNN_RESET_RND_GEN_STATE=0
/ TF_CUDNN_RESET_RND_GEN_STATE=1
to assert the impact
@applied-machinelearning So, we've got some feedback from the nvidia dev, and it seems
TF_CUDNN_RESET_RND_GEN_STATE=1
does help here. I'm unsure of the implications, especially in term of performances, but maybe you can give that a try on your full dataset, this could help us assert:* it is indeed related to the issue * have an idea of the perf impact
Was away from keyboard this weekend, running tests now. The short tests work with that ENV variable set, now running the longer one. Edit: The long test also works.
I'm presently running one or two training epochs with
TF_CUDNN_RESET_RND_GEN_STATE=0
/TF_CUDNN_RESET_RND_GEN_STATE=1
to assert the impact
So I could not spot any huge difference: only a 20-secs per epoch slowdown.
I think you are thinking about setting this environment var from the DS code as a workaround for not getting a TF 1.15.4 release ?
(I think it's not very wise keeping TF 1.15 that broken in the first place, it wastes a lot of resource everywhere from people having their training go bust and perhaps trying to debug that again (for all projects and people still using TF 1.15 with LSTM), while it is a straight and simple fix, so it would be a nice "reward" for digging in this and fixing this thing which was uncaught for so many releases), but that is my not so humble opinion about this.
Back to the environment var: If remember correctly from looking at the code, it influenced some kind of "dropout" and as extra busted the cache (which causes things to work for us), but I don't know what the influence of changing that specific dropout behavior has on training the model. Would be nice if the TF / Nvidia guys can give some comment on that, before we perhaps DS degrade training by missing any side effects.
I think you are thinking about setting this environment var from the DS code as a workaround for not getting a TF 1.15.4 release ?
At least know if it's a good thing to debug people with that or if we are creating underlying issues.
(I think it's not very wise keeping TF 1.15 that broken in the first place, it wastes a lot of resource everywhere from people having their training go bust and perhaps trying to debug that again (for all projects and people still using TF 1.15 with LSTM), while it is a straight and simple fix, so it would be a nice "reward" for digging in this and fixing this thing which was uncaught for so many releases), but that is my not so humble opinion about this.
Sure, but it's not in our hands nor in the hands of people who will review the PR, there's a policy and they might have their hands tied.
Back to the environment var: If remember correctly from looking at the code, it influenced some kind of "dropout" and as extra busted the cache (which causes things to work for us), but I don't know what the influence of changing that specific dropout behavior has on training the model. Would be nice if the TF / Nvidia guys can give some comment on that, before we perhaps DS degrade training by missing any side effects.
Exactly.
I think you are thinking about setting this environment var from the DS code as a workaround for not getting a TF 1.15.4 release ? At least know if it's a good thing to debug people with that or if we are creating underlying issues.
(I think it's not very wise keeping TF 1.15 that broken in the first place, it wastes a lot of resource everywhere from people having their training go bust and perhaps trying to debug that again (for all projects and people still using TF 1.15 with LSTM), while it is a straight and simple fix, so it would be a nice "reward" for digging in this and fixing this thing which was uncaught for so many releases), but that is my not so humble opinion about this.
Sure, but it's not in our hands nor in the hands of people who will review the PR, there's a policy and they might have their hands tied.
That's true, perhaps my dutch heritage that policies are nice when they make sense ;) I'm also fascinated by the little help you get to get the requested test implemented, essentially blocking the patch, most opensource communities I have encountered so far are happy when you fix or even pinpoint (a long standing) bug.
Back to the environment var: If remember correctly from looking at the code, it influenced some kind of "dropout" and as extra busted the cache (which causes things to work for us), but I don't know what the influence of changing that specific dropout behavior has on training the model. Would be nice if the TF / Nvidia guys can give some comment on that, before we perhaps DS degrade training by missing any side effects.
Exactly.
By the way, I'm wondering do you know how often do we still use the cached version on your larger dataset test ? The difference of 20 seconds is so small, that either:
That's true, perhaps my dutch heritage that policies are nice when they make sense ;)
Well, even fixing ruy computation on just-released r2.2 was not taken and only merged on master
I'm also fascinated by the little help you get to get the requested test implemented, essentially blocking the patch, most opensource communities I have encountered so far are happy when you fix or even pinpoint (a long standing) bug.
Well, I can understand why they want that, I guess in their position I'd do the same. Looks like things are moving now, I hope this can go into a 1.15.4 or in the worst case, we need statement on the consequences of the flag.
The fix landed upstream: https://github.com/tensorflow/tensorflow/pull/41832
We still have no feedback whether a 1.15.4 can be issued for that.
Perhaps we should try to stage it as a multi-stage rocket:
First get the patch applied to the r1.15 tensorflow upstream branch, since it was filled as a bug against that, that seems reasonable and as a bonus it applies clean.
Then try to get a release for that branch.
(1) and (2) goes together, it won't get picked on r1.15 if they don't intend to ship 1.15.4
If we don't get a release, we could try to get it applied to mozilla-tensorflow.
What for? Supporting tensorflow wheel builds is a huge tasks, we stopped doing that as soon as we can
And perhaps even provide an prebuild docker base image for training based on the Dockerfile.build.tmpl file and publish that on docker hub ?
Same, that requires us to build and support TensorFlow wheel, which is a lot of work.
- First get the patch applied to the r1.15 tensorflow upstream branch, since it was filled as a bug against that, that seems reasonable and as a bonus it applies clean.
- Then try to get a release for that branch.
(1) and (2) goes together, it won't get picked on r1.15 if they don't intend to ship 1.15.4
If i look at: https://github.com/tensorflow/tensorflow/commits/r1.15 I do see some (non direct bug fix) commits after 1.15.3 without an immediate release. And even some very recent commits.
If we don't get a release, we could try to get it applied to mozilla-tensorflow.
What for? Supporting tensorflow wheel builds is a huge tasks, we stopped doing that as soon as we can
And perhaps even provide an prebuild docker base image for training based on the Dockerfile.build.tmpl file and publish that on docker hub ?
Same, that requires us to build and support TensorFlow wheel, which is a lot of work.
Depends a bit on what you provide. For the 2.x branches I do agree, but since there is were little (relevant) movement on the 1.15 branch that doesn't require very much (or even any) rebuilding since nothing changes. And the question is if you should build for every target. If it's the most common, x86 and only the python version from the ubuntu cuda dev image, it is all fairly limited put provides for the common training case.
If i look at: https://github.com/tensorflow/tensorflow/commits/r1.15 I do see some (non direct bug fix) commits after 1.15.3 without an immediate release. And even some very recent commits.
Then maybe they are considering a 1.15.4 ?
Depends a bit on what you provide. For the 2.x branches I do agree, but since there is were little (relevant) movement on the 1.15 branch that doesn't require very much (or even any) rebuilding since nothing changes. And the question is if you should build for every target. If it's the most common, x86 and only the python version from the ubuntu cuda dev image, it is all fairly limited put provides for the common training case.
You are highly underestimating:
Just building r1.15 for the purpose of those debugging steps took several local hacks. Re-using TensorFlow's CI Docker stuff also required a non trivial amount of work.
I confirm that the flag addressed my issues and that managed me to train and have a fully functioning model.
There has been quite a lot of activity on r1.15 branch on TensorFlow, I think we can safely hope for a 1.15.4 that ships without fix now (current upstream r1.15 has merged the fix). I'll close this issue when 1.15.4 ships.
Still not working for me with up to date master and newly created docker container.
But as mentioned somewhere above, running export TF_CUDNN_RESET_RND_GEN_STATE=1
solved my problem.
Still not working for me with up to date master and newly created docker container.
Can you triple check if you run 1.15.4 ?
But as mentioned somewhere above, running
export TF_CUDNN_RESET_RND_GEN_STATE=1
solved my problem.
Maybe there are some other bugs. As you can see, it was quite painful to investigate already even with a small repro dataset. I'm unfortunately not in the position to have the time to investigate like that anymore for the forseeable future.
Can you triple check if you run 1.15.4 ?
Running python3 -c 'import tensorflow as tf; print(tf.__version__)'
gives me exactly 1.15.4
.
I'm unfortunately not in the position to have the time to investigate like that anymore for the forseeable future.
No problem for me, the solution is easy, so I just will add the extra flag everywhere.
Not sure this helps, but for me the error always gets thrown in validation phase, the first training epoch is finishing without errors. This also happens if I switch train and dev datasets. So I don't think the problem lies in the dataset here.
I think @applied-machinelearning mentionned something like that on upstream issue ?
Yeah it is still on my todo list, but I also still have seen the error at least once. I think you can still have a cache hit while other stuff in the descriptor still differs (from memory .. , I thought rnn_mode was a likely candidate).
I think the pattern for this is when you have the same sequence lengths etc. in both train and dev set. Should be easy testable (just use the same csv (and keep the ordering the same) for both train and dev datasets), but I haven't come around to actually do it. I hope to get to testing this tomorrow or this weekend.
Still wondering if the whole caching idea doesn't do more harm than good. It seems error prone, and if you need to check everything element the cost for checking each time seems non-negligible (as your test seemed to indicate where you didn't find that much difference in training times with or without the TF_CUDNN_RESET_RND_GEN_STATE env var.
Unfortunately there was no reaction from the nvidia guy, seems like it is needed to open a new report. I will after testing.
But perhaps it is still a good idea to implement setting the environment var from deepspeech training code any way ? As I don't think there will be a Tensorflow (1.15.5) release any time soon and most certainly not before a probable deepspeech 1.0 release.
Hmm unfortunately I can't reproduce with what I thought could trigger it (run training and validation on the same sorted by wav_size csv's). :(
It was very interesting following this thread! Learned a lot! Wanted to confirm that the suggested fix works: System Specs: Ubuntu 18.04, Nvidia Driver 410.104, Cuda 10.0, CUDNN 7.6.5, Ryzen 3700x, Nvidia GTX 1080
Added export TF_CUDNN_RESET_RND_GEN_STATE=1
and made my training batch size divisible by the number of training samples.
I didn't notice any significant loss in performance.
Edit: I am using the tensorflow/tensorflow:1.15.4-gpu-py3
Docker image as well
For support and discussions, please use our Discourse forums.
If you've found a bug, or have a feature request, then please create an issue with the following information:
set -xe
apt-get install -y python3-venv libopus0
python3 -m venv /tmp/venv
source /tmp/venv/bin/activate
pip install -U setuptools wheel pip
pip install .
pip uninstall -y tensorflow
pip install tensorflow-gpu==1.14
mkdir -p ../keep/summaries
data="${SHARED_DIR}/data" fis="${data}/LDC/fisher" swb="${data}/LDC/LDC97S62/swb" lbs="${data}/OpenSLR/LibriSpeech/librivox" cv="${data}/mozilla/CommonVoice/en_1087h_2019-06-12/clips" npr="${data}/NPR/WAMU/sets/v0.3"
python -u DeepSpeech.py \ --train_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/treino_filtered_alphabet.csv \ --dev_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/dev_filtered_alphabet.csv \ --test_files /home/andre/projects/corpora/20200404084521_msprompts_90_6s/deepspeech/teste_filtered_alphabet.csv \ --train_batch_size 12 \ --dev_batch_size 24 \ --test_batch_size 24 \ --scorer ~/projects/corpora/deepspeech-pretrained-ptbr/kenlm.scorer \ --alphabet_config_path ~/projects/corpora/deepspeech-pretrained-ptbr/alphabet.txt \ --train_cudnn \ --n_hidden 2048 \ --learning_rate 0.0001 \ --dropout_rate 0.40 \ --epochs 150 \ --noearly_stop \ --audio_sample_rate 8000 \ --save_checkpoint_dir ~/projects/corpora/deepspeech-fulltrain-ptbr \ --use_allow_growth \ --log_level 0
andre@andrednn:~/projects/DeepSpeech$ bash .compute_msprompts
tf.compat.v1.data.get_output_types(iterator)
. W0618 12:30:10.218584 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:347: Iterator.output_types (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_types(iterator)
. WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_shapes(iterator)
. W0618 12:30:10.218781 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:348: Iterator.output_shapes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_shapes(iterator)
. WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_classes(iterator)
. W0618 12:30:10.218892 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/data/ops/iterator_ops.py:350: Iterator.output_classes (from tensorflow.python.data.ops.iterator_ops) is deprecated and will be removed in a future version. Instructions for updating: Usetf.compat.v1.data.get_output_classes(iterator)
. WARNING:tensorflow: The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:W0618 12:30:10.324707 139639980619584 lazy_loader.py:50] The TensorFlow contrib module will not be included in TensorFlow 2.0. For more information, please see:
WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326326 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:342: calling GlorotUniform.init (from tensorflow.python.ops.init_ops) with dt ype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a f uture version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0618 12:30:10.326584 139639980619584 deprecation.py:506] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py:345: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype i s deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where W0618 12:30:10.401312 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py:246: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. W0618 12:30:11.297271 139639980619584 deprecation.py:323] From /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/training/slot_creator.py:193: Variable.initialized_value (from tensorflow.python.ops.variables) is deprecated and will be removed in a future version. Instructions for updating: Use Variable.read_value. Variables in 2.X are initialized automatically both in eager and graph (inside tf.defun) contexts. 2020-06-18 12:30:11.458650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:05:00.0 2020-06-18 12:30:11.459790: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:06:00.0 2020-06-18 12:30:11.460897: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:09:00.0 2020-06-18 12:30:11.462003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 pciBusID: 0000:0a:00.0 2020-06-18 12:30:11.462041: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 2020-06-18 12:30:11.462071: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-06-18 12:30:11.462085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 2020-06-18 12:30:11.462097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 2020-06-18 12:30:11.462109: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 2020-06-18 12:30:11.462121: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 2020-06-18 12:30:11.462133: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-06-18 12:30:11.470539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0, 1, 2, 3 2020-06-18 12:30:11.470679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 2020-06-18 12:30:11.470694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 1 2 3 2020-06-18 12:30:11.470699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N Y Y Y 2020-06-18 12:30:11.470703: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 1: Y N Y Y 2020-06-18 12:30:11.470707: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 2: Y Y N Y 2020-06-18 12:30:11.470710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 3: Y Y Y N 2020-06-18 12:30:11.476196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10478 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:05:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.477355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 10481 MB memory) -> physical GPU (device: 1, name: GeForce GTX 1080 Ti, pci bus id: 0000:06:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.478490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 10481 MB memory) -> physical GPU (device: 2, name: GeForce GTX 1080 Ti, pci bus id: 0000:09:00.0, compute ca pability: 6.1) 2020-06-18 12:30:11.479608: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 10481 MB memory) -> physical GPU (device: 3, name: GeForce GTX 1080 Ti, pci bus id: 0000:0a:00.0, compute ca pability: 6.1) D Session opened. I Could not find best validating checkpoint. I Could not find most recent checkpoint. I Initializing all variables. 2020-06-18 12:30:12.233482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 I STARTING Optimization Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 2020-06-18 12:30:14.672316: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 Epoch 0 | Training | Elapsed Time: 0:00:16 | Steps: 33 | Loss: 18.239303 2 020-06-18 12:30:30.589204: E tensorflow/stream_executor/dnn.cc:588] CUDNN_STATUS_EXECUTION_FAILED in tensorflow/stream_executor/cuda/cuda_dnn.cc(1778): 'cudnnRNNForwardTrainingEx( cudnn.handle(), rnn_desc.handle(), input_desc.data_handle(), input_data.opaque(), input_h_desc.handle(), input_h_data.opaque(), input_c_desc.handle(), input_c_data.opaque(), rnn_desc.param s_handle(), params.opaque(), output_desc.data_handle(), output_data->opaque(), output_h_desc.handle(), output_h_data->opaque(), output_c_desc.handle(), output_c_data->opaque(), nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, nullptr, workspace.opaque(), w orkspace.size(), reserve_space.opaque(), reserve_space.size())' 2020-06-18 12:30:30.589243: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at cudnn_rnn_ops.cc:1517 : Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_uni ts, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] Traceback (most recent call last): File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] (1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048] [[{{node tower_0/cudnn_lstm/CudnnRNNV3_1}}]] [[tower_2/CTCLoss/_147]] 1 successful operations. 2 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "DeepSpeech.py", line 12, in
ds_train.run_script()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script
absl.app.run(main)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main
train()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 608, in train
trainloss, = run_set('train', epoch, train_init_op)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 568, in run_set
feed_dict=feed_dict)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Internal: Failed to call ThenRnnForward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 63, 12, 2048]
[[node tower_0/cudnn_lstm/CudnnRNNV3_1 (defined at /home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_2/CTCLoss/_147]]
1 successful operations.
2 derived errors ignored.
Original stack trace for 'tower_0/cudnn_lstm/CudnnRNNV3_1': File "DeepSpeech.py", line 12, in
ds_train.run_script()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 968, in run_script
absl.app.run(main)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 940, in main
train()
File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 487, in train gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 313, in get_tower_results avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 240, in calculate_mean_edit_distance_andloss logits, = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 191, in create_model output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/deepspeech_training/train.py", line 129, in rnn_impl_cudnn_rnn sequence_lengths=seq_length) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/layers/base.py", line 548, in call outputs = super(Layer, self).call(inputs, *args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in call outputs = call_fn(cast_inputs, *args, *kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper return converted_call(f, options, args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call return _call_unconverted(f, args, kwargs, options) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted return f(args, kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call training) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward seed=self._seed) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn outputs, output_h, outputc, , _ = gen_cudnn_rnn_ops.cudnn_rnnv3(*args) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3 time_major=time_major, name=name) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(args, **kwargs) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/home/andre/projects/DeepSpeech/venv/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in init self._traceback = tf_stack.extract_stack()