Open albertz opened 4 years ago
For reference, with debug_add_check_numerics_ops = True
:
InvalidArgumentError: optimize/gradients/lstm0_bw/rec/NativeLstm2_grad/GradOfNativeLstm2:0 : Tensor had NaN values
[[node add_check_numerics_ops/optimize/gradients/lstm0_bw/rec/NativeLstm2_grad/GradOfNativeLstm2_check_numerics (defined at /work/tools/asr/python/3.8.0_tf_1.15-generic+cuda10.1/lib/python3.8/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Or:
InvalidArgumentError: optimize/gradients/output/rec/att/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3:1 : Tensor had NaN values
[[node add_check_numerics_ops/optimize/gradients/output/rec/att/TensorArrayStack/TensorArrayGatherV3_grad/TensorArrayGrad/TensorArrayGradV3_check_numerics (defined at /work/tools/asr/python/3.8.0_tf_1.15-generic+cuda10.1/lib/python3.8/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Or:
InvalidArgumentError: optimize/gradients/output/rec/subnet_base/input_layers_moved_out/target_embed_ta_unstack/target_embed_ta_unstack/target_embed_ta_unstack_grad/TensorArrayGrad/TensorArrayGradV3:1 : Tensor had NaN values
Or:
InvalidArgumentError: optimize/gradients/conv1/convolution_grad/Conv2DBackpropFilter:0 : Tensor had NaN values
I just noticed that when I run python3 tests/test_TFEngine.py
(i.e. all the tests in there), sometimes I get inf/nan. This is with TF 1.15.2, CUDA 10.0, GTX 2070. It is not deterministic at all. Often it all runs through fine. Also when it occurs on some test (e.g. test_attention_train_then_search
), when I run this test specifically (via python3 tests/test_TFEngine.py test_attention_train_then_search
), it does not occur anymore. It seems important to run the whole test suite. Maybe the problem is only triggered that way, when there was already certain (other) usage of the GPU before.
@critias @pavelgolik maybe you can also try to run that? At least it is some way of test.
python3 tests/test_TFEngine.py test_rec_subnet_train_t3d_simple
can reproduce it (sometimes, actually quite rarely, maybe in 2% of the cases, so run in maybe 100 times to get the error), and that is even with TF 1.14, CUDA 10.1. It might be much more rare than with TF 1.15. Not really sure... (This is not at i6 but some own private env.)
Edit: If you edit the test, and set "num_epochs": 200
in that test, it becomes much more deterministic. I get an inf in >50% of the cases.
Some success! I have created a TF standalone test case here (latest version). This test case has some interesting behavior:
tf.compat.v1.disable_v2_behavior()
), CUDA 10.1, Python 3.7, private env, --reset_after_nsteps 100
: Tensor had Inf valuesI reported the problem upstream: tensorflow #40228
Some update: Whenever CheckNumerics
catches something TensorArray
related, this might be a false positive. The TensorArray uses a dummy "flow" tensor (for gradient management). The value of this flow tensor is totally irrelevant. It's just used to track the dependency order of other operations w.r.t. the TensorArray. So that is why TF does not even initialize this tensor on GPU (see code), and thus this flow tensor can be inf or nan (or anything). Which does not matter at all, as the value is ignored (but CheckNumerics
might not ignore it).
So it means that the test case might not actually reproduce the problem. Also the check_numerics
output of RETURNN reports definitely the wrong tensors. So we need some further debugging on the original failing/unstable RETURNN setup...
Or further extend the test case. Actually the strange hickups (going up and down) I have seen in the test case should not happen.
@critias @pavelgolik @manish-kumar-garg maybe you can try with TF 2?
Note: I was able to reproduce the problem again, with a slightly adopted test case (current version). This test case now removes the check_numerics
calls (which get false positives for TensorArray), but is able to reproduce getting nan after a few steps:
step 19, loss: 0.94387317
step 20, loss: 0.93824387
step 21, loss: 0.9327911
step 22, loss: 0.92746085
ERR, loss invalid: nan
step 23, loss: nan
ERR, loss invalid: nan
step 24, loss: nan
It was crucial in the test to use LSTMBlockCell
. Which is interesting, because the inner matmul
inside LSTMBlockCell
is the same implementation which we also use for our native LSTM cell. Maybe this way to call matmul
is somehow not correct.
Also very interesting is that energy = tf.where(energy_mask, energy, float("-inf") * tf.ones_like(energy))
is also important to trigger the problem. Without that, it does not happen. Random hypothesis: Only this line causes that there is some nan in the GPU memory, which is later released (but not reset), and somehow matmul
makes use of uninitialized GPU memory, and gets this nan now...
Also, this only happens with TF 1.15 now. Specifically: TF 1.15.0, CUDA 10.1, GeForce GTX 1080 Ti, i6 env. It does not happen with TF 1.14.0 or TF 2.2.0.
For reference, with LSTMBlockCell, I also got this inf detection once (but not sure whether it really originates from there):
(1) Invalid argument: gradients/s/LSTMBlockCell_1_grad/LSTMBlockCellGrad:1 : Tensor had Inf values
[[node add_check_numerics_ops/gradients/s/LSTMBlockCell_1_grad/LSTMBlockCellGrad_check_numerics_1 (defined at /work/tools/asr/python/3.8.0_tf_1.15-generic+cuda10.1/lib/python3.8/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Some code links for reference:
REGISTER_OP
) of LSTMBlockCell etc here (v1.14.0).
E.g. LSTMBlockCellGrad:1
corresponds to .Output("dicfo: T")
.REGISTER_KERNEL_BUILDER
) of LSTMBlockCell etc here (v1.14.0).
LSTMBlockCellOp
calls functor::LSTMBlockCellFprop<Device, T, USE_CUBLAS>
.
LSTMBlockCellGradOp
does not use matmul
. It wraps functor::LSTMBlockCellBprop<Device, T, USE_CUBLAS>
.
It also implements LSTMBlockCellBprop<CPUDevice, T, false /* USE_CUBLAS */>
which calls LSTMBlockCellBpropWithEigen
.LSTMBlockCellBprop<GPUDevice,T,true>
etc here (v1.14.0).
LSTMBlockCellFpropWithCUDA
. Calls TensorBlasGemm
.
LSTMBlockCellBprop<<GPUDevice, T, true /* USE_CUBLAS */>
just calls LSTMBlockCellBpropWithCUDA
, and that launches the CUDA kernel lstm_gates_bprop
.TensorBlasGemm
, and TensorBlasGemm<Device, T, true /* USE_CUBLAS */>
calls TensorCuBlasGemm
.TensorCuBlasGemm
. Calls ctx->op_device_context()->stream()->ThenBlasGemm
.MatMul
op kernel implementation here (v1.15.0).
LaunchMatMul<GPUDevice, T, true /* USE_CUBLAS */>
does stream->ThenBlasGemm
or many other possible variants.Hi, I have a problem when training the setup on librispeech:
pretrain epoch 11, step 628, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9795918166637421, error:decision 0.0, error:output/output_prob 0.9795918166637421, loss nan, max_size:classes 55, max_size:data 1623, mem_usage:GPU:0 3.8GB, num_seqs 12, 0.858 sec/step, elapsed 0:08:40, exp. remaining 0:05:55, complete 59.41% pretrain epoch 11, step 629, cost:ctc nan, cost:output/output_prob nan, error:ctc 0.9792060726322233, error:decision 0.0, error:output/output_prob 0.9792060726322233, loss nan, max_size:classes 59, max_size:data 1709, mem_usage:GPU:0 3.8GB, num_seqs 11, 0.888 sec/step, elapsed 0:08:41, exp. remaining 0:05:54, complete 59.49%
But if I do the same on CPU there are no any problems.
I used TF 1.15, CUDA 10.0, Python 3.6 on my own computer.
I attach the
log
and
log for the run on CPU
@valerie94 Can you try TF 1.14 and TF 2.2 and report whether you see the same problem?
@valerie94 Can you try TF 1.14 and TF 2.2 and report whether you see the same problem?
@albertz This problem doesn't occur when I use TF 1.14 and TF 2.2. In my case it only happens with TF 1.15.
Thanks, good to know. So we can just recommend to use TF 1.14 and >=TF 2.2.
It would crash at an early step in the first epoch with a message like:
Multiple people have independently reported this (@critias @patrick-wilken and myself @albertz).
Maybe related:
253
Upstream report:
Our cases:
/usr/local/cuda-10.1_new/lib64/libcublas.so.10
,/usr/local/cudnn-10.1-v7.6/lib64/libcudnn.so.7.6.3
)It is not clear at this point whether this is something wrong in RETURNN, or some bug in TF, or bug in CUDA, or some (unlucky but correct) properties of TF 1.15 lead more often to such instability?
It was mostly reported for RNNs, but no matter what LSTM implementation (NativeLstm2, LSTMBlock and StandardLSTM all can cause this). The same TF 1.15 worked fine so far for Transformer models.
So maybe it is just some bad luck. Although it seems to happen in many cases, so it would be good to find a workaround. Maybe it is also a bug (in TF or in RETURNN).