get Segmentation fault when training

etyhh commented 3 years ago

Hi, Error log as below:

Starting training. Performing evaluation. loss Tensor("transducer/dense_1/BiasAdd:0", shape=(None, None, None, 3971), dtype=float32, device=/job:localhost/replica:0/task:0/device:GPU:0) Tensor("dist_inputs_4:0", shape=(None, None), dtype=int32) Tensor("Cast:0", shape=(None,), dtype=int32, device=/job:localhost/replica:0/task:0/device:GPU:0) Tensor("dist_inputs_3:0", shape=(None,), dtype=int32) Fatal Python error: Segmentation fault

Thread 0x00007f6989132700 (most recent call first): File "/usr/lib64/python3.6/threading.py", line 295 in wait File "/usr/lib64/python3.6/threading.py", line 551 in wait File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 978 in run File "/usr/lib64/python3.6/threading.py", line 916 in _bootstrap_inner File "/usr/lib64/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007f6989933700 (most recent call first): File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654 in _create_c_op File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1817 in init File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3327 in _create_op_internal File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 595 in _create_op_internal File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 744 in _apply_op_helper File "", line 81 in warp_rnnt File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 348 in _call_unconverted File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 534 in converted_call File "/tmp/tmp7wdbpl1g.py", line 11 in tf__rnnt_loss File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 587 in converted_call File "/tmp/tmpbbrruc7p.py", line 30 in tf___loss_fn File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 585 in converted_call File "/tmp/tmp5y46mg16.py", line 25 in step_fn File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 998 in run File "/usr/lib64/python3.6/threading.py", line 916 in _bootstrap_inner File "/usr/lib64/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007f6d24153740 (most recent call first): File "/usr/lib64/python3.6/threading.py", line 295 in wait File "/usr/lib64/python3.6/threading.py", line 551 in wait File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 165 in _call_for_each_replica File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 770 in _call_for_each_replica File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2290 in call_for_each_replica File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/distribute/distribute_lib.py", line 951 in run File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 346 in _call_unconverted File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 492 in converted_call File "/tmp/tmp5y46mg16.py", line 66 in tf__eval_step File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/autograph/impl/api.py", line 585 in converted_call File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 964 in wrapper File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 441 in wrapped_fn File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 981 in func_graph_from_py_func File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2667 in _create_graph_function File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2777 in _maybe_define_function File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2446 in _get_concrete_function_internal_garbage_collected File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 506 in _initialize File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 627 in _call File "/home/zhangqin/.local/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 580 in call File "run_rnnt.py", line 434 in run_evaluate File "run_rnnt.py", line 312 in checkpoint_model File "run_rnnt.py", line 347 in run_training File "run_rnnt.py", line 547 in main File "/home/zhangqin/.local/lib/python3.6/site-packages/absl/app.py", line 251 in _run_main File "/home/zhangqin/.local/lib/python3.6/site-packages/absl/app.py", line 300 in run File "run_rnnt.py", line 588 in Segmentation fault (core dumped)

the code which caused Segmentation fault print(y_pred, y_true, spec_lengths, label_lengths) loss = rnnt_loss(y_pred, y_true, spec_lengths, label_lengths) print('l f')

Thanks

ryantang1993 commented 3 years ago

I also met the same problem. Have you found a solution?

etyhh commented 3 years ago

I also met the same problem. Have you found a solution?

I found in the below code: loss = rnnt_loss(y_pred, y_true, spec_lengths, label_lengths) y_pred is 'tensorflow.python.framework.ops.Tensor' change rnn to dnn and y_pred became ''tensorflow.python.framework.ops.EagerTensor' and Segmentation fault disappear. I'm working on using rnn and get EagerTensor

noahchalifour commented 3 years ago

@etyhh What version of TensorFlow are you using?

etyhh commented 3 years ago

@etyhh What version of TensorFlow are you using?

tensorflow-gpu==2.2.0

ryantang1993 commented 3 years ago

When I set print(tf.executing_eagerly()) before loss = rnnt_loss(y_pred, y_true, spec_lengths, label_lengths)， got False， that is to say， the eager mode changed in the loss function.

etyhh commented 3 years ago

When I set print(tf.executing_eagerly()) before loss = rnnt_loss(y_pred, y_true, spec_lengths, label_lengths)， got False， that is to say， the eager mode changed in the loss function.

I tried add tf.config.experimental_run_functions_eagerly(True) at the begin of run_rnnt.py and loss.py. before loss = rnnt_loss() , print(tf.executing_eagerly()) return True but print(type(y_pred)) return 'tensorflow.python.framework.ops.Tensor'

ChristopheZhao commented 3 years ago

I encountered same error as you,and i assumed the err error is generated from rnnt_loss, i have try some ways ,but it didn't work,anyone has fixed it?

etyhh commented 3 years ago

change tf.compat.v1.nn.rnn_cell.LSTMCell to tf.keras.layers.LSTMCell works for me But tf.keras.layers.LSTMCell doesn't support projection

noahchalifour / rnnt-speech-recognition

get Segmentation fault when training #42