Open vieting opened 12 months ago
Ah, that's just in help_on_tf_exception
, which is not critical (help_on_tf_exception
is itself for debugging only, to print some additional information, and for some reason, it fails).
But it means there was another actual exception happening before. Can you post the full log?
Sure, the full log is here:
See also in /work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn.log
to avoid the broken color codes here.
I created script to reproduce the error: vieting@cn-285:/work/asr4/vieting/tmp/20231108_tf213_sprint_op $ ./run_example.sh
We encountered this bug and there is a patch for it. Daniel wanted to do a PR.
On Wed, Nov 8, 2023, 12:25 vieting @.***> wrote:
Sure, the full log is here:
RETURNN starting up, version 1.20231107.125810+git.dbef0ca0, date/time 2023-11-08-12-17-46 (UTC+0100), pid 1212279, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3 RETURNN command line options: ['returnn.config'] Hostname: cn-04 TensorFlow: 2.13.0 (v2.13.0-rc2-7-g1cb1a030a62) (
in /usr/local/lib/python3.8/dist-packages/tensorflow) Use num_threads=1 (but min 2) via OMP_NUM_THREADS. Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}. CUDA_VISIBLE_DEVICES is not set. Collecting TensorFlow device list... Local devices available to TensorFlow: 1/1: name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 3855380559335333431 xla_global_id: -1 Train data: input: 1 x 1 output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]} OggZipDataset, sequences: 249229, frames: unknown Dev data: OggZipDataset, sequences: 300, frames: unknown RETURNN starting up, version 1.20231107.125810+git.dbef0ca0, date/time 2023-11-08-12-18-11 (UTC+0100), pid 3325131, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3 RETURNN command line options: ['returnn.config'] Hostname: cn-285 TensorFlow: 2.13.0 (v2.13.0-rc2-7-g1cb1a030a62) ( in /usr/local/lib/python3.8/dist-packages/tensorflow) Use num_threads=1 (but min 2) via OMP_NUM_THREADS. Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}. CUDA_VISIBLE_DEVICES is set to '2'. Collecting TensorFlow device list... Local devices available to TensorFlow: 1/2: name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 7046766875533982763 xla_global_id: -1 2/2: name: "/device:GPU:0" device_type: "GPU" memory_limit: 10089005056 locality { bus_id: 1 links { } } incarnation: 14158601620701111509 physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:41:00.0, compute capability: 7.5" xla_global_id: 416903419 Using gpu device 2: NVIDIA GeForce RTX 2080 Ti Hostname 'cn-285', GPU 2, GPU-dev-name 'NVIDIA GeForce RTX 2080 Ti', GPU-memory 9.4GB Train data: input: 1 x 1 output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]} OggZipDataset, sequences: 249229, frames: unknown Dev data: OggZipDataset, sequences: 300, frames: unknown Learning-rate-control: file learning_rates.swb.ctc does not exist yet Setup TF session with options {'log_device_placement': False, 'device_count': {'GPU': 1}} ... layer /'data': [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)] float32 layer /features/'conv_h_filter': ['conv_h_filter:static:0'(128),'conv_h_filter:static:1'(1),F|F'conv_h_filter:static:2'(150)] float32 layer /features/'conv_h': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_act': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_split': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F'conv_h:channel'(150),F|F'conv_h_split_split_dims1'(1)] float32 DEPRECATION WARNING: Explicitly specify in_spatial_dims when there is more than one spatial dim in the input. This will be disallowed with behavior_version 8. layer /features/'conv_l': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel'(150),F|F'conv_l:channel'(5)] float32 layer /features/'conv_l_merge': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 DEPRECATION WARNING: MergeDimsLayer, only keep_order=True is allowed This will be disallowed with behavior_version 6. layer /features/'conv_l_act_no_norm': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 layer /features/'conv_l_act': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 layer /features/'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 layer /'features': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 layer /'specaug': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 layer /'conv_source': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channelconv_l:channel'(750),F|F'conv_source_split_dims1'(1)] float32 layer /'conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channelconv_l:channel'(750),F|F'conv_1:channel'(32)] float32 layer /'conv_1_pool': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],'conv_h:channelconv_l:channel//2'(375),F|F'conv_1:channel'(32)] float32 layer /'conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/32⌉'[B],'conv_h:channelconv_l:channel//2'(375),F|F'conv_2:channel'(64)] float32 layer /'conv_3': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'conv_h:channelconv_l:channel//2'(375),F|F'conv_3:channel'(64)] float32 layer /'conv_merged': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conv_h:channelconv_l:channel//2)*conv_3:channel'(24000)] float32 layer /'input_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'input_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_1_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_pointwise_conv_1:feature-dense'(1024)] float32 layer /'conformer_1_conv_mod_glu': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conformer_1_conv_mod_pointwise_conv_1:feature-dense)//2'(512)] float32 layer /'conformer_1_conv_mod_depthwise_conv': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_bn': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 DEPRECATION WARNING: batch_norm masked_time should be specified explicitly This will be disallowed with behavior_version 12. layer /'conformer_1_conv_mod_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_relpos_encoding': [T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_relpos_encoding_rel_pos_enc_feat'(64)] float32 layer /'conformer_1_mhsa_mod_self_attention': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_att_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_2_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'encoder': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'output:feature-dense'(88)] float32 Network layer topology: extern data: data: Tensor{[B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]}, seq_tag: Tensor{[B?], dtype='string'} used data keys: ['data', 'seq_tag'] layers: layer batch_norm 'conformer_1_conv_mod_bn' #: 512 layer conv 'conformer_1_conv_mod_depthwise_conv' #: 512 layer copy 'conformer_1_conv_mod_dropout' #: 512 layer gating 'conformer_1_conv_mod_glu' #: 512 layer layer_norm 'conformer_1_conv_mod_ln' #: 512 layer linear 'conformer_1_conv_mod_pointwise_conv_1' #: 1024 layer linear 'conformer_1_conv_mod_pointwise_conv_2' #: 512 layer combine 'conformer_1_conv_mod_res_add' #: 512 layer activation 'conformer_1_conv_mod_swish' #: 512 layer copy 'conformer_1_ffmod_1_dropout' #: 512 layer linear 'conformer_1_ffmod_1_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_1_half_res_add' #: 512 layer linear 'conformer_1_ffmod_1_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_1_ln' #: 512 layer copy 'conformer_1_ffmod_2_dropout' #: 512 layer linear 'conformer_1_ffmod_2_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_2_half_res_add' #: 512 layer linear 'conformer_1_ffmod_2_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_2_ln' #: 512 layer linear 'conformer_1_mhsa_mod_att_linear' #: 512 layer copy 'conformer_1_mhsa_mod_dropout' #: 512 layer layer_norm 'conformer_1_mhsa_mod_ln' #: 512 layer relative_positional_encoding 'conformer_1_mhsa_mod_relpos_encoding' #: 64 layer combine 'conformer_1_mhsa_mod_res_add' #: 512 layer self_attention 'conformer_1_mhsa_mod_self_attention' #: 512 layer layer_norm 'conformer_1_output' #: 512 layer conv 'conv_1' #: 32 layer pool 'conv_1_pool' #: 32 layer conv 'conv_2' #: 64 layer conv 'conv_3' #: 64 layer merge_dims 'conv_merged' #: 24000 layer split_dims 'conv_source' #: 1 layer source 'data' #: 1 layer copy 'encoder' #: 512 layer subnetwork 'features' #: 750 layer conv 'features/conv_h' #: 150 layer eval 'features/conv_h_act' #: 150 layer variable 'features/conv_h_filter' #: 150 layer split_dims 'features/conv_h_split' #: 1 layer conv 'features/conv_l' #: 5 layer layer_norm 'features/conv_l_act' #: 750 layer eval 'features/conv_l_act_no_norm' #: 750 layer merge_dims 'features/conv_l_merge' #: 750 layer copy 'features/output' #: 750 layer copy 'input_dropout' #: 512 layer linear 'input_linear' #: 512 layer softmax 'output' #: 88 layer eval 'specaug' #: 750 net params #: 18473980 net trainable params: [<tf.Variable 'conformer_1_conv_mod_bn/batch_norm/conformer_1_conv_mod_bn_conformer_1_conv_mod_bn_output_beta:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_bn/batch_norm/conformer_1_conv_mod_bn_conformer_1_conv_mod_bn_output_gamma:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_depthwise_conv/W:0' shape=(32, 1, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_depthwise_conv/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_1/W:0' shape=(512, 1024) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_1/b:0' shape=(1024,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_2/W:0' shape=(512, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_2/b:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_dropout_linear/W:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_dropout_linear/b:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_linear_swish/W:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_linear_swish/b:0' shape=(2048,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_dropout_linear/W:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_dropout_linear/b:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_linear_swish/W:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_linear_swish/b:0' shape=(2048,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_att_linear/W:0' shape=(512, 512) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_relpos_encoding/encoding_matrix:0' shape=(65, 64) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_self_attention/QKV:0' shape=(512, 1536) dtype=float32>, <tf.Variable 'conformer_1_output/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_output/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conv_1/W:0' shape=(3, 3, 1, 32) dtype=float32>, <tf.Variable 'conv_1/bias:0' shape=(32,) dtype=float32>, <tf.Variable 'conv_2/W:0' shape=(3, 3, 32, 64) dtype=float32>, <tf.Variable 'conv_2/bias:0' shape=(64,) dtype=float32>, <tf.Variable 'conv_3/W:0' shape=(3, 3, 64, 64) dtype=float32>, <tf.Variable 'conv_3/bias:0' shape=(64,) dtype=float32>, <tf.Variable 'features/conv_h_filter/conv_h_filter:0' shape=(128, 1, 150) dtype=float32>, <tf.Variable 'features/conv_l/W:0' shape=(40, 1, 1, 5) dtype=float32>, <tf.Variable 'features/conv_l_act/bias:0' shape=(750,) dtype=float32>, <tf.Variable 'features/conv_l_act/scale:0' shape=(750,) dtype=float32>, <tf.Variable 'input_linear/W:0' shape=(24000, 512) dtype=float32>, <tf.Variable 'output/W:0' shape=(512, 88) dtype=float32>, <tf.Variable 'output/b:0' shape=(88,) dtype=float32>] start training at epoch 1 using batch size: {'classes': 5000, 'data': 400000}, max seqs: 128 learning rate control: NewbobMultiEpoch(num_epochs=6, update_interval=1, relative_error_threshold=-0.01, relative_error_grow_threshold=-0.01), epoch data: 1: EpochData(learningRate=1.325e-05, error={}), 2: EpochData(learningRate=1.539861111111111e-05, error={}), 3: EpochData(learningRate=1.754722222222222e-05, error={}), ..., 360: EpochData(learningRate=1.4333333333333375e-05, error={}), 361: EpochData(learningRate=1.2166666666666727e-05, error={}), 362: EpochData(learningRate=1e-05, error={}), error key: None pretrain: None start epoch 1 with learning rate 1.325e-05 ... TF: log_dir: output/models/train-2023-11-08-11-18-11 Create optimizer <class 'returnn.tf.updater.NadamOptimizer'> with options {'epsilon': 1e-08, 'learning_rate': <tf.Variable 'learning_rate:0' shape=() dtype=float32>}. Initialize optimizer (default) with slots ['m', 'v']. These additional variable were created by the optimizer: [<tf.Variable 'optimize/gradients/conformer_1_conv_mod_bn/batch_norm/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_bn/batch_norm/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_depthwise_conv/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(32, 1, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_depthwise_conv/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_1/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 1024) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_1/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(1024,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_2/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_2/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_dropout_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_dropout_linear/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_linear_swish/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_linear_swish/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_dropout_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_dropout_linear/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_linear_swish/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_linear_swish/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_att_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_relpos_encoding/Gather_grad/Reshape_accum_grad/var_accum_grad:0' shape=(65, 64) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_self_attention/dot/MatMul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512, 1536) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_output/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_output/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conv_1/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(3, 3, 1, 32) dtype=float32>, <tf.Variable 'optimize/gradients/conv_1/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(32,) dtype=float32>, <tf.Variable 'optimize/gradients/conv_2/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(3, 3, 32, 64) dtype=float32>, <tf.Variable 'optimize/gradients/conv_2/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(64,) dtype=float32>, <tf.Variable 'optimize/gradients/conv_3/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(3, 3, 64, 64) dtype=float32>, <tf.Variable 'optimize/gradients/conv_3/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(64,) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_h/convolution/ExpandDims_1_grad/Reshape_accum_grad/var_accum_grad:0' shape=(128, 1, 150) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_l/convolution_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(40, 1, 1, 5) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_l_act/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(750,) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_l_act/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(750,) dtype=float32>, <tf.Variable 'optimize/gradients/input_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(24000, 512) dtype=float32>, <tf.Variable 'optimize/gradients/output/linear/dot/MatMul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512, 88) dtype=float32>, <tf.Variable 'optimize/gradients/output/linear/add_bias_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(88,) dtype=float32>, <tf.Variable 'optimize/apply_grads/accum_grad_multiple_step/beta1_power:0' shape=() dtype=float32>, <tf.Variable 'optimize/apply_grads/accum_grad_multiple_step/beta2_power:0' shape=() dtype=float32>]. SprintSubprocessInstance: exec ['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--.python-control-enabled=true', '--.pymod-path=/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository', '--.pymod-name=returnn.sprint.control', '--.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--.configuration.channel=output-channel', '--.real-time-factor.channel=output-channel', '--.system-info.channel=output-channel', '--.time.channel=output-channel', '--.version.channel=output-channel', '--.log.channel=output-channel', '--.warning.channel=output-channel,', 'stderr', '--.error.channel=output-channel,', 'stderr', '--.statistics.channel=output-channel', '--.progress.channel=output-channel', '--.dot.channel=nil', '--.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--.model-combination.acoustic-model.state-tying.type=lookup', '--.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--.model-combination.acoustic-model.allophones.add-all=yes', '--.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--.model-combination.acoustic-model.hmm.states-per-phone=1', '--.model-combination.acoustic-model.hmm.state-repetitions=1', '--.model-combination.acoustic-model.hmm.across-word-model=yes', '--.model-combination.acoustic-model.hmm.early-recombination=no', '--.model-combination.acoustic-model.tdp.scale=1.0', '--.model-combination.acoustic-model.tdp..loop=0.0', '--.model-combination.acoustic-model.tdp..forward=0.0', '--.model-combination.acoustic-model.tdp..skip=infinity', '--.model-combination.acoustic-model.tdp..exit=0.0', '--.model-combination.acoustic-model.tdp.silence.loop=0.0', '--.model-combination.acoustic-model.tdp.silence.forward=0.0', '--.model-combination.acoustic-model.tdp.silence.skip=infinity', '--.model-combination.acoustic-model.tdp.silence.exit=0.0', '--.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--.model-combination.acoustic-model.phonology.history-length=0', '--.model-combination.acoustic-model.phonology.future-length=0', '--.transducer-builder-filter-out-invalid-allophones=yes', '--.fix-allophone-context-at-word-boundaries=yes', '--.allophone-state-graph-builder.topology=ctc', '--.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--.encoding=UTF-8', '--.output-channel.file=$(LOGFILE)', '--.output-channel.compressed=no', '--.output-channel.append=no', '--.output-channel.unbuffered=no', '--.LOGFILE=nn-trainer.loss.log', '--.TASK=1'] SprintSubprocessInstance: starting, pid 3325822 SprintSubprocessInstance: Sprint child process (['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--.python-control-enabled=true', '--.pymod-path=/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository', '--.pymod-name=returnn.sprint.control', '--.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--.configuration.channel=output-channel', '--.real-time-factor.channel=output-channel', '--.system-info.channel=output-channel', '--.time.channel=output-channel', '--.version.channel=output-channel', '--.log.channel=output-channel', '--.warning.channel=output-channel,', 'stderr', '--.error.channel=output-channel,', 'stderr', '--.statistics.channel=output-channel', '--.progress.channel=output-channel', '--.dot.channel=nil', '--.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--.model-combination.acoustic-model.state-tying.type=lookup', '--.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--.model-combination.acoustic-model.allophones.add-all=yes', '--.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--.model-combination.acoustic-model.hmm.states-per-phone=1', '--.model-combination.acoustic-model.hmm.state-repetitions=1', '--.model-combination.acoustic-model.hmm.across-word-model=yes', '--.model-combination.acoustic-model.hmm.early-recombination=no', '--.model-combination.acoustic-model.tdp.scale=1.0', '--.model-combination.acoustic-model.tdp..loop=0.0', '--.model-combination.acoustic-model.tdp..forward=0.0', '--.model-combination.acoustic-model.tdp..skip=infinity', '--.model-combination.acoustic-model.tdp..exit=0.0', '--.model-combination.acoustic-model.tdp.silence.loop=0.0', '--.model-combination.acoustic-model.tdp.silence.forward=0.0', '--.model-combination.acoustic-model.tdp.silence.skip=infinity', '--.model-combination.acoustic-model.tdp.silence.exit=0.0', '--.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--.model-combination.acoustic-model.phonology.history-length=0', '--.model-combination.acoustic-model.phonology.future-length=0', '--.transducer-builder-filter-out-invalid-allophones=yes', '--.fix-allophone-context-at-word-boundaries=yes', '--.allophone-state-graph-builder.topology=ctc', '--.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--.encoding=UTF-8', '--.output-channel.file=$(LOGFILE)', '--.output-channel.compressed=no', '--.output-channel.append=no', '--.output-channel.unbuffered=no', '--.LOGFILE=nn-trainer.loss.log', '--.TASK=1']) caused an exception. TensorFlow exception: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in
main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in init self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in init self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load()
EOFError: Ran out of input
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in call ret = func(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts))
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in init self.init()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed")
Exception: SprintSubprocessInstance Sprint init failed
[[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last):
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load()
EOFError: Ran out of input
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in call ret = func(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts))
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in init self.init()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed")
Exception: SprintSubprocessInstance Sprint init failed
[[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored.
Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in
main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in init self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func return func(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler return fn(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler return dispatch_target(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func return py_func_common(func, inp, Tout, stateful, name=name) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common return _internal_py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func result = gen_script_ops.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in pyfunc , _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal ret = Operation.from_node_def( Exception UnknownError() in step 0. (pid 3325131) Failing op: <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc> We tried to fetch the op inputs ([<tf.Tensor 'extern_data/placeholders/seq_tag/seq_tag:0' shape=(?,) dtype=string>]) but got another exception: target_op <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc>, ops [<tf.Operation 'extern_data/placeholders/seq_tag/seq_tag' type=Placeholder>] �[31;1mEXCEPTION�[0m �[34mTraceback (most recent call last):�[0m �[34;1mFile�[0m �[36m"/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/�[0m�[36;1msession.py�[0m�[36m"�[0m, �[34mline�[0m �[35m1379�[0m, �[34min�[0m BaseSession._do_call �[34mline:�[0m �[34mreturn�[0m fn�[34m(�[0m�[34m*�[0margs�[34m)�[0m �[34mlocals:�[0m fn �[34;1m=�[0m �[34m
�[0m �[34m<�[0mfunction BaseSession�[34m.�[0m_do_run�[34m.�[0m�[34m<�[0mlocals�[34m>�[0m�[34m.�[0m_run_fn at 0x7f2192d77d30�[34m>�[0m args �[34;1m=�[0m �[34m �[0m �[34m(�[0m�[34m{�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2422de3eb0�[34m>�[0m�[34m:�[0m array�[34m(�[0m�[34m[�[0m�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m05505638�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m09610788�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m05115783�[34m]�[0m�[34m,�[0m �[34m.�[0m�[34m.�[0m�[34m.�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m00226238�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m01049833�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m00...
�[34;1mFile�[0m �[36m"/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/�[0m�[36;1msession.py�[0m�[36m"�[0m, �[34mline�[0m �[35m1362�[0m, �[34min�[0m BaseSession._do_run.
._run_fn �[34mline:�[0m �[34mreturn�[0m self�[34m.�[0m_call_tf_sessionrun�[34m(�[0moptions�[34m,�[0m feed_dict�[34m,�[0m fetch_list�[34m,�[0m target_list�[34m,�[0m run_metadata�[34m)�[0m �[34mlocals:�[0m self �[34;1m=�[0m �[34m �[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0msession�[34m.�[0mSession object at 0x7f2571096ac0�[34m>�[0m self�[34;1m.�[0m_call_tf_sessionrun �[34;1m=�[0m �[34m �[0m �[34m<�[0mbound method BaseSession�[34m.�[0m_call_tf_sessionrun of �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0msession�[34m.�[0mSession object at 0x7f2571096ac0�[34m>�[0m�[34m>�[0m options �[34;1m=�[0m �[34m �[0m �[34mNone�[0m feed_dict �[34;1m=�[0m �[34m �[0m �[34m{�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2422de3eb0�[34m>�[0m�[34m:�[0m array�[34m(�[0m�[34m[�[0m�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m05505638�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m09610788�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m05115783�[34m]�[0m�[34m,�[0m �[34m.�[0m�[34m.�[0m�[34m.�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m00226238�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m01049833�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m001... fetch_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f24250d81b0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423f96cf0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423b01830�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Ou... target_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa970�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa930�[34m>�[0m�[34m]�[0m run_metadata �[34;1m=�[0m �[34m<local>�[0m �[34mNone�[0m
�[34;1mFile�[0m �[36m"/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/�[0m�[36;1msession.py�[0m�[36m"�[0m, �[34mline�[0m �[35m1455�[0m, �[34min�[0m BaseSession._call_tf_sessionrun �[34mline:�[0m �[34mreturn�[0m tf_session�[34m.�[0mTF_SessionRun_wrapper�[34m(�[0mself�[34m.�[0m_session�[34m,�[0m options�[34m,�[0m feed_dict�[34m,�[0m fetch_list�[34m,�[0m target_list�[34m,�[0m run_metadata�[34m)�[0m �[34mlocals:�[0m tf_session �[34;1m=�[0m �[34m
�[0m �[34m<�[0mmodule �[36m'tensorflow.python.client.pywrap_tf_session'�[0m �[34mfrom�[0m �[36m'/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/pywrap_tf_session.py'�[0m�[34m>�[0m tf_session�[34;1m.�[0mTF_SessionRun_wrapper �[34;1m=�[0m �[34m �[0m �[34m<�[0mbuilt�[34m-�[0m�[34min�[0m method TF_SessionRun_wrapper of PyCapsule object at 0x7f2538137300�[34m>�[0m self �[34;1m=�[0m �[34m �[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0msession�[34m.�[0mSession object at 0x7f2571096ac0�[34m>�[0m self�[34;1m.�[0m_session �[34;1m=�[0m �[34m �[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Session object at 0x7f2423372a70�[34m>�[0m options �[34;1m=�[0m �[34m �[0m �[34mNone�[0m feed_dict �[34;1m=�[0m �[34m �[0m �[34m{�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2422de3eb0�[34m>�[0m�[34m:�[0m array�[34m(�[0m�[34m[�[0m�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m05505638�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m09610788�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m05115783�[34m]�[0m�[34m,�[0m �[34m.�[0m�[34m.�[0m�[34m.�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m00226238�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m01049833�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m001... fetch_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f24250d81b0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423f96cf0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423b01830�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Ou... target_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa970�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa930�[34m>�[0m�[34m]�[0m run_metadata �[34;1m=�[0m �[34m<local>�[0m �[34mNone�[0m
�[31mUnknownError�[0m: 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last):
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load()
EOFError: Ran out of input
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in call ret = func(*args)
File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i)
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts))
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in init self.init()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed")
Exception: SprintSubprocessInstance Sprint init failed
[[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last):
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core
1 diff --git a/returnn/sprint/error_signals.py b/returnn/sprint/error_signals.py
2 index 735ac363..1c204e68 100644
3 --- a/returnn/sprint/error_signals.py
4 +++ b/returnn/sprint/error_signals.py
5 @@ -130,7 +130,7 @@ class SprintSubprocessInstance:
6
7 def _start_child(self):
8 assert self.child_pid is None
9 - self.pipe_c2p = self._pipe_open()
10 + self.pipe_c2p = self._pipe_open(buffered=True)
11 self.pipe_p2c = self._pipe_open()
12 args = self._build_sprint_args()
13 print("SprintSubprocessInstance: exec", args, file=log.v5)
14 @@ -169,14 +169,14 @@ class SprintSubprocessInstance:
15 raise Exception("SprintSubprocessInstance Sprint init failed")
16
17 # noinspection PyMethodMayBeStatic
18 - def _pipe_open(self):
19 + def _pipe_open(self, buffered=False):
20 readend, writeend = os.pipe()
21 if hasattr(os, "set_inheritable"):
22 # https://www.python.org/dev/peps/pep-0446/
23 os.set_inheritable(readend, True)
24 os.set_inheritable(writeend, True)
25 - readend = os.fdopen(readend, "rb", 0)
26 - writeend = os.fdopen(writeend, "wb", 0)
27 + readend = os.fdopen(readend, "rb", -bool(buffered)) # -1 is default for buffered
28 + writeend = os.fdopen(writeend, "wb", -bool(buffered))
29 return readend, writeend
30
31 @property
~ ~ ~ ~ ~ 1 diff --git a/returnn/sprint/error_signals.py b/returnn/sprint/error_signals.py 2 index 735ac363..1c204e68 100644 3 --- a/returnn/sprint/error_signals.py 4 +++ b/returnn/sprint/error_signals.py 5 @@ -130,7 +130,7 @@ class SprintSubprocessInstance: 6 7 def _start_child(self): 8 assert self.child_pid is None 9 - self.pipe_c2p = self._pipe_open() 10 + self.pipe_c2p = self._pipe_open(buffered=True) 11 self.pipe_p2c = self._pipe_open() 12 args = self._build_sprint_args() 13 print("SprintSubprocessInstance: exec", args, file=log.v5) 14 @@ -169,14 +169,14 @@ class SprintSubprocessInstance: 15 raise Exception("SprintSubprocessInstance Sprint init failed") 16 17 # noinspection PyMethodMayBeStatic 18 - def _pipe_open(self): 19 + def _pipe_open(self, buffered=False): 20 readend, writeend = os.pipe() 21 if hasattr(os, "set_inheritable"): 22 # https://www.python.org/dev/peps/pep-0446/ 23 os.set_inheritable(readend, True) 24 os.set_inheritable(writeend, True) 25 - readend = os.fdopen(readend, "rb", 0) 26 - writeend = os.fdopen(writeend, "wb", 0) 27 + readend = os.fdopen(readend, "rb", -bool(buffered)) # -1 is default for buffered 28 + writeend = os.fdopen(writeend, "wb", -bool(buffered)) 29 return readend, writeend 30 31 @property
AFAIR, the problem occurs when running in apptainer environment only. The buffer does not contain all info and returnn crashes because of rasr automata being truncated/ not complete
So for reference, the actual error is this:
Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch'
2 root error(s) found.
(0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed
Traceback (most recent call last):
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child
ret = self._read()
File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read
return Unpickler(p).load()
EOFError: Ran out of input
I just tested the proposed patch and it does not fix the issue for my example.
Can you link the full patch? It seems incomplete here.
Can you link the full patch? It seems incomplete here.
Sure, just edited the comment.
@vieting I pushed sth which should fix this. Can you try?
(For reference, there was also an EOFError in #1363, but I think that was another problem.)
Note: I did not actually test my recent change, as I don't have any setup ready to try this. Please try it out and report if it works.
Just tested and I still get the error.
Log:
@albertz check /work/asr4/vieting/tmp/20231108_tf213_sprint_op/run_example.sh
if you want to test it yourself.
@christophmluscher @NeoLegends does this relate to the rasr compiled with TF 2.13? Do you recognize this error?
Is it maybe a problem that RASR was compiled with my old tf 2.8 image? I still use the same RASR binary with the new image. Loading the automata does not require tf, so I thought, that I can use the same RASR.
@vieting I pushed another small change. Can you try again?
I pushed another small change. Can you try again?
Unfortunately, this still does not fix my example.
Traceback (most recent call last):
File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child
ret = self._read()
File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read
return util.read_pickled_object(p)
File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object
size_raw = read_bytes_to_new_buffer(p, 4).getvalue()
File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer
raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size))
EOFError: expected to read 4 bytes but got EOF after 0 bytes
I get the same error when using a tf 2.14 image and RASR compiled using that image.
Is that the original stdout + stderr, or just the log?
It looks a bit like maybe RASR does not correctly starts at all? You should e.g. see this then on stdout:
print("RETURNN SprintControl[pid %i] Python module load" % os.getpid())
And then:
print(
(
"RETURNN SprintControl[pid %i] init: "
"name=%r, sprint_unit=%r, version_number=%r, callback=%r, ref=%r, config=%r, kwargs=%r"
)
% (os.getpid(), name, sprint_unit, version_number, callback, reference, config, kwargs)
)
If you don't see that, then my recent fixes, and also Tinas patch are not really related to your issue at all.
You should check the RASR log then. There should be some error by RASR, probably Python related, maybe sth like that it could not load the module or so. Maybe some import missing.
What I posted before was from the log. The following is copied from stdout and stderr (with tf 2.14 image, also for RASR compilation):
I created an apptainer image with tf 2.13 and tried to run a training with
FastBaumWelchLoss
. It crashes in step 0 because theget_sprint_automata_for_batch
op is not found.The actual error is this: