rwth-i6 / returnn

The RWTH extensible training framework for universal recurrent neural networks
http://returnn.readthedocs.io/
Other
349 stars 130 forks source link

TF get_sprint_automata_for_batch: RASR segmentation fault in `Speech::CTCTopologyGraphBuilder::addLoopTransition` #1456

Open vieting opened 12 months ago

vieting commented 12 months ago

I created an apptainer image with tf 2.13 and tried to run a training with FastBaumWelchLoss. It crashes in step 0 because the get_sprint_automata_for_batch op is not found.

``` EXCEPTION Traceback (most recent call last): File ".../returnn/tf/network.py", line 4341, in help_on_tf_exception line: debug_fetch, fetch_helpers, op_copied = FetchHelper.copy_graph( debug_fetch, target_op=op, fetch_helper_tensors=list(op.inputs), stop_at_ts=stop_at_ts, verbose_stream=file, ) locals: debug_fetch = fetch_helpers = op_copied = FetchHelper = FetchHelper.copy_graph = > target_op = op = fetch_helper_tensors = list = op.inputs = (,) stop_at_ts = [, , , file = File ".../returnn/tf/util/basic.py", line 7700, in FetchHelper.copy_graph line: assert target_op in ops, "target_op %r,\nops\n%s" % (target_op, pformat(ops)) locals: target_op = ops = [] pformat = AssertionError: target_op , ops [] ```

The actual error is this:

Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch'
2 root error(s) found.
  (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed
Traceback (most recent call last):

  File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child
    ret = self._read()

  File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read
    return Unpickler(p).load()

EOFError: Ran out of input
albertz commented 12 months ago

Ah, that's just in help_on_tf_exception, which is not critical (help_on_tf_exception is itself for debugging only, to print some additional information, and for some reason, it fails).

But it means there was another actual exception happening before. Can you post the full log?

vieting commented 12 months ago

Sure, the full log is here:

``` RETURNN starting up, version 1.20231107.125810+git.dbef0ca0, date/time 2023-11-08-12-17-46 (UTC+0100), pid 1212279, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3 RETURNN command line options: ['returnn.config'] Hostname: cn-04 TensorFlow: 2.13.0 (v2.13.0-rc2-7-g1cb1a030a62) ( in /usr/local/lib/python3.8/dist-packages/tensorflow) Use num_threads=1 (but min 2) via OMP_NUM_THREADS. Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}. CUDA_VISIBLE_DEVICES is not set. Collecting TensorFlow device list... Local devices available to TensorFlow: 1/1: name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 3855380559335333431 xla_global_id: -1 Train data: input: 1 x 1 output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]} OggZipDataset, sequences: 249229, frames: unknown Dev data: OggZipDataset, sequences: 300, frames: unknown RETURNN starting up, version 1.20231107.125810+git.dbef0ca0, date/time 2023-11-08-12-18-11 (UTC+0100), pid 3325131, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3 RETURNN command line options: ['returnn.config'] Hostname: cn-285 TensorFlow: 2.13.0 (v2.13.0-rc2-7-g1cb1a030a62) ( in /usr/local/lib/python3.8/dist-packages/tensorflow) Use num_threads=1 (but min 2) via OMP_NUM_THREADS. Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}. CUDA_VISIBLE_DEVICES is set to '2'. Collecting TensorFlow device list... Local devices available to TensorFlow: 1/2: name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 7046766875533982763 xla_global_id: -1 2/2: name: "/device:GPU:0" device_type: "GPU" memory_limit: 10089005056 locality { bus_id: 1 links { } } incarnation: 14158601620701111509 physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:41:00.0, compute capability: 7.5" xla_global_id: 416903419 Using gpu device 2: NVIDIA GeForce RTX 2080 Ti Hostname 'cn-285', GPU 2, GPU-dev-name 'NVIDIA GeForce RTX 2080 Ti', GPU-memory 9.4GB Train data: input: 1 x 1 output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]} OggZipDataset, sequences: 249229, frames: unknown Dev data: OggZipDataset, sequences: 300, frames: unknown Learning-rate-control: file learning_rates.swb.ctc does not exist yet Setup TF session with options {'log_device_placement': False, 'device_count': {'GPU': 1}} ... layer /'data': [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)] float32 layer /features/'conv_h_filter': ['conv_h_filter:static:0'(128),'conv_h_filter:static:1'(1),F|F'conv_h_filter:static:2'(150)] float32 layer /features/'conv_h': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_act': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_split': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F'conv_h:channel'(150),F|F'conv_h_split_split_dims1'(1)] float32 DEPRECATION WARNING: Explicitly specify in_spatial_dims when there is more than one spatial dim in the input. This will be disallowed with behavior_version 8. layer /features/'conv_l': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel'(150),F|F'conv_l:channel'(5)] float32 layer /features/'conv_l_merge': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 DEPRECATION WARNING: MergeDimsLayer, only keep_order=True is allowed This will be disallowed with behavior_version 6. layer /features/'conv_l_act_no_norm': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /features/'conv_l_act': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /features/'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'features': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'specaug': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'conv_source': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_source_split_dims1'(1)] float32 layer /'conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_1:channel'(32)] float32 layer /'conv_1_pool': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_1:channel'(32)] float32 layer /'conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/32⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_2:channel'(64)] float32 layer /'conv_3': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_3:channel'(64)] float32 layer /'conv_merged': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conv_h:channel*conv_l:channel//2)*conv_3:channel'(24000)] float32 layer /'input_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'input_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_1_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_pointwise_conv_1:feature-dense'(1024)] float32 layer /'conformer_1_conv_mod_glu': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conformer_1_conv_mod_pointwise_conv_1:feature-dense)//2'(512)] float32 layer /'conformer_1_conv_mod_depthwise_conv': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_bn': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 DEPRECATION WARNING: batch_norm masked_time should be specified explicitly This will be disallowed with behavior_version 12. layer /'conformer_1_conv_mod_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_relpos_encoding': [T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_relpos_encoding_rel_pos_enc_feat'(64)] float32 layer /'conformer_1_mhsa_mod_self_attention': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_att_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_2_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'encoder': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'output:feature-dense'(88)] float32 Network layer topology: extern data: data: Tensor{[B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]}, seq_tag: Tensor{[B?], dtype='string'} used data keys: ['data', 'seq_tag'] layers: layer batch_norm 'conformer_1_conv_mod_bn' #: 512 layer conv 'conformer_1_conv_mod_depthwise_conv' #: 512 layer copy 'conformer_1_conv_mod_dropout' #: 512 layer gating 'conformer_1_conv_mod_glu' #: 512 layer layer_norm 'conformer_1_conv_mod_ln' #: 512 layer linear 'conformer_1_conv_mod_pointwise_conv_1' #: 1024 layer linear 'conformer_1_conv_mod_pointwise_conv_2' #: 512 layer combine 'conformer_1_conv_mod_res_add' #: 512 layer activation 'conformer_1_conv_mod_swish' #: 512 layer copy 'conformer_1_ffmod_1_dropout' #: 512 layer linear 'conformer_1_ffmod_1_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_1_half_res_add' #: 512 layer linear 'conformer_1_ffmod_1_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_1_ln' #: 512 layer copy 'conformer_1_ffmod_2_dropout' #: 512 layer linear 'conformer_1_ffmod_2_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_2_half_res_add' #: 512 layer linear 'conformer_1_ffmod_2_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_2_ln' #: 512 layer linear 'conformer_1_mhsa_mod_att_linear' #: 512 layer copy 'conformer_1_mhsa_mod_dropout' #: 512 layer layer_norm 'conformer_1_mhsa_mod_ln' #: 512 layer relative_positional_encoding 'conformer_1_mhsa_mod_relpos_encoding' #: 64 layer combine 'conformer_1_mhsa_mod_res_add' #: 512 layer self_attention 'conformer_1_mhsa_mod_self_attention' #: 512 layer layer_norm 'conformer_1_output' #: 512 layer conv 'conv_1' #: 32 layer pool 'conv_1_pool' #: 32 layer conv 'conv_2' #: 64 layer conv 'conv_3' #: 64 layer merge_dims 'conv_merged' #: 24000 layer split_dims 'conv_source' #: 1 layer source 'data' #: 1 layer copy 'encoder' #: 512 layer subnetwork 'features' #: 750 layer conv 'features/conv_h' #: 150 layer eval 'features/conv_h_act' #: 150 layer variable 'features/conv_h_filter' #: 150 layer split_dims 'features/conv_h_split' #: 1 layer conv 'features/conv_l' #: 5 layer layer_norm 'features/conv_l_act' #: 750 layer eval 'features/conv_l_act_no_norm' #: 750 layer merge_dims 'features/conv_l_merge' #: 750 layer copy 'features/output' #: 750 layer copy 'input_dropout' #: 512 layer linear 'input_linear' #: 512 layer softmax 'output' #: 88 layer eval 'specaug' #: 750 net params #: 18473980 net trainable params: [, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ] start training at epoch 1 using batch size: {'classes': 5000, 'data': 400000}, max seqs: 128 learning rate control: NewbobMultiEpoch(num_epochs=6, update_interval=1, relative_error_threshold=-0.01, relative_error_grow_threshold=-0.01), epoch data: 1: EpochData(learningRate=1.325e-05, error={}), 2: EpochData(learningRate=1.539861111111111e-05, error={}), 3: EpochData(learningRate=1.754722222222222e-05, error={}), ..., 360: EpochData(learningRate=1.4333333333333375e-05, error={}), 361: EpochData(learningRate=1.2166666666666727e-05, error={}), 362: EpochData(learningRate=1e-05, error={}), error key: None pretrain: None start epoch 1 with learning rate 1.325e-05 ... TF: log_dir: output/models/train-2023-11-08-11-18-11 Create optimizer with options {'epsilon': 1e-08, 'learning_rate': }. Initialize optimizer (default) with slots ['m', 'v']. These additional variable were created by the optimizer: [, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]. SprintSubprocessInstance: exec ['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=no', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1'] SprintSubprocessInstance: starting, pid 3325822 SprintSubprocessInstance: Sprint child process (['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=no', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1']) caused an exception. TensorFlow exception: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load() EOFError: Ran out of input During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in __init__ self.init() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load() EOFError: Ran out of input During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in __init__ self.init() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler return dispatch_target(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func return py_func_common(func, inp, Tout, stateful, name=name) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common return _internal_py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func result = gen_script_ops.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func _, _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal ret = Operation.from_node_def( Exception UnknownError() in step 0. (pid 3325131) Failing op: We tried to fetch the op inputs ([]) but got another exception: target_op , ops [] EXCEPTION Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1379, in BaseSession._do_call line: return fn(*args) locals: fn =  <function BaseSession._do_run.<locals>._run_fn at 0x7f2192d77d30> args =  ({<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f2422de3eb0>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.00... File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1362, in BaseSession._do_run.._run_fn line: return self._call_tf_sessionrun(options, feed_dict, fetch_list, target_list, run_metadata) locals: self =  <tensorflow.python.client.session.Session object at 0x7f2571096ac0> self._call_tf_sessionrun =  <bound method BaseSession._call_tf_sessionrun of <tensorflow.python.client.session.Session object at 0x7f2571096ac0>> options =  None feed_dict =  {<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f2422de3eb0>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f24250d81b0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f2423f96cf0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f2423b01830>, <tensorflow.python.client._pywrap_tf_session.TF_Ou... target_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f24080fa970>, <tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f24080fa930>] run_metadata =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1455, in BaseSession._call_tf_sessionrun line: return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, fetch_list, target_list, run_metadata) locals: tf_session =  <module 'tensorflow.python.client.pywrap_tf_session' from '/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/pywrap_tf_session.py'> tf_session.TF_SessionRun_wrapper =  <built-in method TF_SessionRun_wrapper of PyCapsule object at 0x7f2538137300> self =  <tensorflow.python.client.session.Session object at 0x7f2571096ac0> self._session =  <tensorflow.python.client._pywrap_tf_session.TF_Session object at 0x7f2423372a70> options =  None feed_dict =  {<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f2422de3eb0>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f24250d81b0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f2423f96cf0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f2423b01830>, <tensorflow.python.client._pywrap_tf_session.TF_Ou... target_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f24080fa970>, <tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f24080fa930>] run_metadata =  None UnknownError: 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load() EOFError: Ran out of input During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in __init__ self.init() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load() EOFError: Ran out of input During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in __init__ self.init() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 744, in Runner.run line: fetches_results = sess.run( fetches_dict, feed_dict=feed_dict, options=run_options ) # type: typing.Dict[str,typing.Union[numpy.ndarray,str]] locals: fetches_results =  sess =  <tensorflow.python.client.session.Session object at 0x7f2571096ac0> sess.run =  <bound method BaseSession.run of <tensorflow.python.client.session.Session object at 0x7f2571096ac0>> fetches_dict =  {'size:data:0': <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>, 'loss': <tf.Tensor 'objective/add:0' shape=() dtype=float32>, 'cost:output': <tf.Tensor 'objective/loss/loss/FastBaumWelchLoss/generic_loss_and_error_signal:0' shape=() dtype=float32>, 'loss_norm_..., len = 8 feed_dict =  {<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options =  run_options =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 969, in BaseSession.run line: result = self._run(None, fetches, feed_dict, options_ptr, run_metadata_ptr) locals: result =  self =  <tensorflow.python.client.session.Session object at 0x7f2571096ac0> self._run =  <bound method BaseSession._run of <tensorflow.python.client.session.Session object at 0x7f2571096ac0>> fetches =  {'size:data:0': <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>, 'loss': <tf.Tensor 'objective/add:0' shape=() dtype=float32>, 'cost:output': <tf.Tensor 'objective/loss/loss/FastBaumWelchLoss/generic_loss_and_error_signal:0' shape=() dtype=float32>, 'loss_norm_..., len = 8 feed_dict =  {<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options_ptr =  None run_metadata_ptr =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1192, in BaseSession._run line: results = self._do_run(handle, final_targets, final_fetches, feed_dict_tensor, options, run_metadata) locals: results =  self =  <tensorflow.python.client.session.Session object at 0x7f2571096ac0> self._do_run =  <bound method BaseSession._do_run of <tensorflow.python.client.session.Session object at 0x7f2571096ac0>> handle =  None final_targets =  [<tf.Operation 'conformer_1_conv_mod_bn/batch_norm/cond/Merge_1' type=Merge>, <tf.Operation 'optim_and_step_incr' type=NoOp>] final_fetches =  [<tf.Tensor 'objective/add:0' shape=() dtype=float32>, <tf.Tensor 'objective/loss/loss/FastBaumWelchLoss/generic_loss_and_error_signal:0' shape=() dtype=float32>, <tf.Tensor 'objective/loss/loss_init/truediv:0' shape=() dtype=float32>, <tf.Tensor 'globals/mem_usage_deviceGPU0:0' shape=() dtype=in... feed_dict_tensor =  {<Reference wrapping <tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049... options =  None run_metadata =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1372, in BaseSession._do_run line: return self._do_call(_run_fn, feeds, fetches, targets, options, run_metadata) locals: self =  <tensorflow.python.client.session.Session object at 0x7f2571096ac0> self._do_call =  <bound method BaseSession._do_call of <tensorflow.python.client.session.Session object at 0x7f2571096ac0>> _run_fn =  <function BaseSession._do_run.<locals>._run_fn at 0x7f2192d77d30> feeds =  {<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f2422de3eb0>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetches =  [<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f24250d81b0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f2423f96cf0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f2423b01830>, <tensorflow.python.client._pywrap_tf_session.TF_Ou... targets =  [<tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f24080fa970>, <tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f24080fa930>] options =  None run_metadata =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1398, in BaseSession._do_call line: raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter locals: type =  <class 'type'> e =  node_def =  name: "objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch" op: "PyFunc" input: "extern_data/placeholders/seq_tag/seq_tag" attr { key: "token" value { s: "pyfunc_0" } } attr { key: "Tout" value { list { type: DT_INT32 type: DT_FLOAT type: DT_INT... op =  <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc> message =  'Graph execution error:\n\nDetected at node \'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch\' defined at (most recent call last):\n File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in <..., len = 14876 UnknownError: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load() EOFError: Ran out of input During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in __init__ self.init() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load() EOFError: Ran out of input During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in __init__ self.init() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler return dispatch_target(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func return py_func_common(func, inp, Tout, stateful, name=name) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common return _internal_py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func result = gen_script_ops.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func _, _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal ret = Operation.from_node_def( During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4341, in help_on_tf_exception line: debug_fetch, fetch_helpers, op_copied = FetchHelper.copy_graph( debug_fetch, target_op=op, fetch_helper_tensors=list(op.inputs), stop_at_ts=stop_at_ts, verbose_stream=file, ) locals: debug_fetch =  <tf.Operation 'extern_data/placeholders/seq_tag/seq_tag' type=Placeholder> fetch_helpers =  op_copied =  FetchHelper =  <class 'returnn.tf.util.basic.FetchHelper'> FetchHelper.copy_graph =  <bound method FetchHelper.copy_graph of <class 'returnn.tf.util.basic.FetchHelper'>> target_op =  op =  <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc> fetch_helper_tensors =  list =  <class 'list'> op.inputs =  (<tf.Tensor 'extern_data/placeholders/seq_tag/seq_tag:0' shape=(?,) dtype=string>,) stop_at_ts =  [<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>, <tf.Tensor 'extern_data/placeholders/seq_tag/seq_tag:0' shape=(?,) dtype=string>, <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>, <tf.Tensor 'extern_data/placeholders/batch_dim:... verbose_stream =  file =  <returnn.log.Stream object at 0x7f25711ccdf0> File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/util/basic.py", line 7700, in FetchHelper.copy_graph line: assert target_op in ops, "target_op %r,\nops\n%s" % (target_op, pformat(ops)) locals: target_op =  <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc> ops =  [<tf.Operation 'extern_data/placeholders/seq_tag/seq_tag' type=Placeholder>] pformat =  <function pformat at 0x7f2575517c10> AssertionError: target_op , ops [] Step meta information: {'seq_idx': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38], 'seq_tag': ['switchboard-1/sw02721B/sw2721B-ms98-a-0031', 'switchboard-1/sw02427A/sw2427A-ms98-a-0021', 'switchboard-1/sw02848B/sw2848B-ms98-a-0086', 'switchboard-1/sw04037A/sw4037A-ms98-a-0027', 'switchboard-1/sw02370B/sw2370B-ms98-a-0117', 'switchboard-1/sw02145A/sw2145A-ms98-a-0107', 'switchboard-1/sw02484A/sw2484A-ms98-a-0077', 'switchboard-1/sw02768A/sw2768A-ms98-a-0064', 'switchboard-1/sw03312B/sw3312B-ms98-a-0041', 'switchboard-1/sw02344B/sw2344B-ms98-a-0023', 'switchboard-1/sw04248B/sw4248B-ms98-a-0017', 'switchboard-1/sw02762A/sw2762A-ms98-a-0059', 'switchboard-1/sw03146A/sw3146A-ms98-a-0047', 'switchboard-1/sw03032A/sw3032A-ms98-a-0065', 'switchboard-1/sw02288A/sw2288A-ms98-a-0080', 'switchboard-1/sw02751A/sw2751A-ms98-a-0066', 'switchboard-1/sw02369A/sw2369A-ms98-a-0118', 'switchboard-1/sw04169A/sw4169A-ms98-a-0059', 'switchboard-1/sw02227A/sw2227A-ms98-a-0016', 'switchboard-1/sw02061B/sw2061B-ms98-a-0170', 'switchboard-1/sw02862B/sw2862B-ms98-a-0033', 'switchboard-1/sw03116B/sw3116B-ms98-a-0065', 'switchboard-1/sw03517B/sw3517B-ms98-a-0038', 'switchboard-1/sw02360B/sw2360B-ms98-a-0086', 'switchboard-1/sw02510B/sw2510B-ms98-a-0061', 'switchboard-1/sw03919A/sw3919A-ms98-a-0017', 'switchboard-1/sw02965A/sw2965A-ms98-a-0045', 'switchboard-1/sw03154A/sw3154A-ms98-a-0073', 'switchboard-1/sw02299A/sw2299A-ms98-a-0005', 'switchboard-1/sw04572A/sw4572A-ms98-a-0026', 'switchboard-1/sw02682A/sw2682A-ms98-a-0022', 'switchboard-1/sw02808A/sw2808A-ms98-a-0014', 'switchboard-1/sw04526A/sw4526A-ms98-a-0026', 'switchboard-1/sw03180B/sw3180B-ms98-a-0010', 'switchboard-1/sw03227A/sw3227A-ms98-a-0029', 'switchboard-1/sw03891B/sw3891B-ms98-a-0008', 'switchboard-1/sw03882B/sw3882B-ms98-a-0041', 'switchboard-1/sw03102B/sw3102B-ms98-a-0027', 'switchboard-1/sw02454A/sw2454A-ms98-a-0029']} Feed dict: : int(39) : shape (39, 10208, 1), dtype float32, min/max -1.0/1.0, mean/stddev 0.0014351769/0.11459725, Tensor{'data', [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]} : shape (39,), dtype int32, min/max 4760/10208, ([ 4760 6246 6372 6861 7296 7499 7534 7622 7824 8031 8295 8431 8690 8675 8667 8886 9084 9199 9163 9156 9274 9262 9540 9668 9678 9719 9711 9902 9989 10010 10020 10073 10006 10102 10131 10112 10130 10178 10208]) : type , Tensor{'seq_tag', [B?], dtype='string'} : bool(True) Save model under output/models/epoch.001.crash_0 Trainer not finalized, quitting. (pid 3325131) ```

See also in /work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn.log to avoid the broken color codes here.

I created script to reproduce the error: vieting@cn-285:/work/asr4/vieting/tmp/20231108_tf213_sprint_op $ ./run_example.sh

Marvin84 commented 12 months ago

We encountered this bug and there is a patch for it. Daniel wanted to do a PR.

On Wed, Nov 8, 2023, 12:25 vieting @.***> wrote:

Sure, the full log is here:

RETURNN starting up, version 1.20231107.125810+git.dbef0ca0, date/time 2023-11-08-12-17-46 (UTC+0100), pid 1212279, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3 RETURNN command line options: ['returnn.config'] Hostname: cn-04 TensorFlow: 2.13.0 (v2.13.0-rc2-7-g1cb1a030a62) ( in /usr/local/lib/python3.8/dist-packages/tensorflow) Use num_threads=1 (but min 2) via OMP_NUM_THREADS. Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}. CUDA_VISIBLE_DEVICES is not set. Collecting TensorFlow device list... Local devices available to TensorFlow: 1/1: name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 3855380559335333431 xla_global_id: -1 Train data: input: 1 x 1 output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]} OggZipDataset, sequences: 249229, frames: unknown Dev data: OggZipDataset, sequences: 300, frames: unknown RETURNN starting up, version 1.20231107.125810+git.dbef0ca0, date/time 2023-11-08-12-18-11 (UTC+0100), pid 3325131, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3 RETURNN command line options: ['returnn.config'] Hostname: cn-285 TensorFlow: 2.13.0 (v2.13.0-rc2-7-g1cb1a030a62) ( in /usr/local/lib/python3.8/dist-packages/tensorflow) Use num_threads=1 (but min 2) via OMP_NUM_THREADS. Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}. CUDA_VISIBLE_DEVICES is set to '2'. Collecting TensorFlow device list... Local devices available to TensorFlow: 1/2: name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 7046766875533982763 xla_global_id: -1 2/2: name: "/device:GPU:0" device_type: "GPU" memory_limit: 10089005056 locality { bus_id: 1 links { } } incarnation: 14158601620701111509 physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:41:00.0, compute capability: 7.5" xla_global_id: 416903419 Using gpu device 2: NVIDIA GeForce RTX 2080 Ti Hostname 'cn-285', GPU 2, GPU-dev-name 'NVIDIA GeForce RTX 2080 Ti', GPU-memory 9.4GB Train data: input: 1 x 1 output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]} OggZipDataset, sequences: 249229, frames: unknown Dev data: OggZipDataset, sequences: 300, frames: unknown Learning-rate-control: file learning_rates.swb.ctc does not exist yet Setup TF session with options {'log_device_placement': False, 'device_count': {'GPU': 1}} ... layer /'data': [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)] float32 layer /features/'conv_h_filter': ['conv_h_filter:static:0'(128),'conv_h_filter:static:1'(1),F|F'conv_h_filter:static:2'(150)] float32 layer /features/'conv_h': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_act': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_split': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F'conv_h:channel'(150),F|F'conv_h_split_split_dims1'(1)] float32 DEPRECATION WARNING: Explicitly specify in_spatial_dims when there is more than one spatial dim in the input. This will be disallowed with behavior_version 8. layer /features/'conv_l': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel'(150),F|F'conv_l:channel'(5)] float32 layer /features/'conv_l_merge': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 DEPRECATION WARNING: MergeDimsLayer, only keep_order=True is allowed This will be disallowed with behavior_version 6. layer /features/'conv_l_act_no_norm': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 layer /features/'conv_l_act': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 layer /features/'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 layer /'features': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 layer /'specaug': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channelconv_l:channel'(750)] float32 layer /'conv_source': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channelconv_l:channel'(750),F|F'conv_source_split_dims1'(1)] float32 layer /'conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channelconv_l:channel'(750),F|F'conv_1:channel'(32)] float32 layer /'conv_1_pool': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],'conv_h:channelconv_l:channel//2'(375),F|F'conv_1:channel'(32)] float32 layer /'conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/32⌉'[B],'conv_h:channelconv_l:channel//2'(375),F|F'conv_2:channel'(64)] float32 layer /'conv_3': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'conv_h:channelconv_l:channel//2'(375),F|F'conv_3:channel'(64)] float32 layer /'conv_merged': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conv_h:channelconv_l:channel//2)*conv_3:channel'(24000)] float32 layer /'input_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'input_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_1_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_pointwise_conv_1:feature-dense'(1024)] float32 layer /'conformer_1_conv_mod_glu': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conformer_1_conv_mod_pointwise_conv_1:feature-dense)//2'(512)] float32 layer /'conformer_1_conv_mod_depthwise_conv': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_bn': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 DEPRECATION WARNING: batch_norm masked_time should be specified explicitly This will be disallowed with behavior_version 12. layer /'conformer_1_conv_mod_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_relpos_encoding': [T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_relpos_encoding_rel_pos_enc_feat'(64)] float32 layer /'conformer_1_mhsa_mod_self_attention': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_att_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_2_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'encoder': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'output:feature-dense'(88)] float32 Network layer topology: extern data: data: Tensor{[B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]}, seq_tag: Tensor{[B?], dtype='string'} used data keys: ['data', 'seq_tag'] layers: layer batch_norm 'conformer_1_conv_mod_bn' #: 512 layer conv 'conformer_1_conv_mod_depthwise_conv' #: 512 layer copy 'conformer_1_conv_mod_dropout' #: 512 layer gating 'conformer_1_conv_mod_glu' #: 512 layer layer_norm 'conformer_1_conv_mod_ln' #: 512 layer linear 'conformer_1_conv_mod_pointwise_conv_1' #: 1024 layer linear 'conformer_1_conv_mod_pointwise_conv_2' #: 512 layer combine 'conformer_1_conv_mod_res_add' #: 512 layer activation 'conformer_1_conv_mod_swish' #: 512 layer copy 'conformer_1_ffmod_1_dropout' #: 512 layer linear 'conformer_1_ffmod_1_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_1_half_res_add' #: 512 layer linear 'conformer_1_ffmod_1_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_1_ln' #: 512 layer copy 'conformer_1_ffmod_2_dropout' #: 512 layer linear 'conformer_1_ffmod_2_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_2_half_res_add' #: 512 layer linear 'conformer_1_ffmod_2_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_2_ln' #: 512 layer linear 'conformer_1_mhsa_mod_att_linear' #: 512 layer copy 'conformer_1_mhsa_mod_dropout' #: 512 layer layer_norm 'conformer_1_mhsa_mod_ln' #: 512 layer relative_positional_encoding 'conformer_1_mhsa_mod_relpos_encoding' #: 64 layer combine 'conformer_1_mhsa_mod_res_add' #: 512 layer self_attention 'conformer_1_mhsa_mod_self_attention' #: 512 layer layer_norm 'conformer_1_output' #: 512 layer conv 'conv_1' #: 32 layer pool 'conv_1_pool' #: 32 layer conv 'conv_2' #: 64 layer conv 'conv_3' #: 64 layer merge_dims 'conv_merged' #: 24000 layer split_dims 'conv_source' #: 1 layer source 'data' #: 1 layer copy 'encoder' #: 512 layer subnetwork 'features' #: 750 layer conv 'features/conv_h' #: 150 layer eval 'features/conv_h_act' #: 150 layer variable 'features/conv_h_filter' #: 150 layer split_dims 'features/conv_h_split' #: 1 layer conv 'features/conv_l' #: 5 layer layer_norm 'features/conv_l_act' #: 750 layer eval 'features/conv_l_act_no_norm' #: 750 layer merge_dims 'features/conv_l_merge' #: 750 layer copy 'features/output' #: 750 layer copy 'input_dropout' #: 512 layer linear 'input_linear' #: 512 layer softmax 'output' #: 88 layer eval 'specaug' #: 750 net params #: 18473980 net trainable params: [<tf.Variable 'conformer_1_conv_mod_bn/batch_norm/conformer_1_conv_mod_bn_conformer_1_conv_mod_bn_output_beta:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_bn/batch_norm/conformer_1_conv_mod_bn_conformer_1_conv_mod_bn_output_gamma:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_depthwise_conv/W:0' shape=(32, 1, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_depthwise_conv/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_1/W:0' shape=(512, 1024) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_1/b:0' shape=(1024,) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_2/W:0' shape=(512, 512) dtype=float32>, <tf.Variable 'conformer_1_conv_mod_pointwise_conv_2/b:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_dropout_linear/W:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_dropout_linear/b:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_linear_swish/W:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_linear_swish/b:0' shape=(2048,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_1_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_dropout_linear/W:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_dropout_linear/b:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_linear_swish/W:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_linear_swish/b:0' shape=(2048,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_ffmod_2_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_att_linear/W:0' shape=(512, 512) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_ln/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_ln/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_relpos_encoding/encoding_matrix:0' shape=(65, 64) dtype=float32>, <tf.Variable 'conformer_1_mhsa_mod_self_attention/QKV:0' shape=(512, 1536) dtype=float32>, <tf.Variable 'conformer_1_output/bias:0' shape=(512,) dtype=float32>, <tf.Variable 'conformer_1_output/scale:0' shape=(512,) dtype=float32>, <tf.Variable 'conv_1/W:0' shape=(3, 3, 1, 32) dtype=float32>, <tf.Variable 'conv_1/bias:0' shape=(32,) dtype=float32>, <tf.Variable 'conv_2/W:0' shape=(3, 3, 32, 64) dtype=float32>, <tf.Variable 'conv_2/bias:0' shape=(64,) dtype=float32>, <tf.Variable 'conv_3/W:0' shape=(3, 3, 64, 64) dtype=float32>, <tf.Variable 'conv_3/bias:0' shape=(64,) dtype=float32>, <tf.Variable 'features/conv_h_filter/conv_h_filter:0' shape=(128, 1, 150) dtype=float32>, <tf.Variable 'features/conv_l/W:0' shape=(40, 1, 1, 5) dtype=float32>, <tf.Variable 'features/conv_l_act/bias:0' shape=(750,) dtype=float32>, <tf.Variable 'features/conv_l_act/scale:0' shape=(750,) dtype=float32>, <tf.Variable 'input_linear/W:0' shape=(24000, 512) dtype=float32>, <tf.Variable 'output/W:0' shape=(512, 88) dtype=float32>, <tf.Variable 'output/b:0' shape=(88,) dtype=float32>] start training at epoch 1 using batch size: {'classes': 5000, 'data': 400000}, max seqs: 128 learning rate control: NewbobMultiEpoch(num_epochs=6, update_interval=1, relative_error_threshold=-0.01, relative_error_grow_threshold=-0.01), epoch data: 1: EpochData(learningRate=1.325e-05, error={}), 2: EpochData(learningRate=1.539861111111111e-05, error={}), 3: EpochData(learningRate=1.754722222222222e-05, error={}), ..., 360: EpochData(learningRate=1.4333333333333375e-05, error={}), 361: EpochData(learningRate=1.2166666666666727e-05, error={}), 362: EpochData(learningRate=1e-05, error={}), error key: None pretrain: None start epoch 1 with learning rate 1.325e-05 ... TF: log_dir: output/models/train-2023-11-08-11-18-11 Create optimizer <class 'returnn.tf.updater.NadamOptimizer'> with options {'epsilon': 1e-08, 'learning_rate': <tf.Variable 'learning_rate:0' shape=() dtype=float32>}. Initialize optimizer (default) with slots ['m', 'v']. These additional variable were created by the optimizer: [<tf.Variable 'optimize/gradients/conformer_1_conv_mod_bn/batch_norm/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_bn/batch_norm/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(1, 1, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_depthwise_conv/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(32, 1, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_depthwise_conv/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_1/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 1024) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_1/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(1024,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_2/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_conv_mod_pointwise_conv_2/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_dropout_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_dropout_linear/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_linear_swish/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_linear_swish/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_1_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_dropout_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_dropout_linear/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_linear_swish/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 2048) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_linear_swish/b_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(2048,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_ffmod_2_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_att_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(512, 512) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_ln/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_ln/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_relpos_encoding/Gather_grad/Reshape_accum_grad/var_accum_grad:0' shape=(65, 64) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_mhsa_mod_self_attention/dot/MatMul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512, 1536) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_output/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conformer_1_output/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512,) dtype=float32>, <tf.Variable 'optimize/gradients/conv_1/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(3, 3, 1, 32) dtype=float32>, <tf.Variable 'optimize/gradients/conv_1/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(32,) dtype=float32>, <tf.Variable 'optimize/gradients/conv_2/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(3, 3, 32, 64) dtype=float32>, <tf.Variable 'optimize/gradients/conv_2/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(64,) dtype=float32>, <tf.Variable 'optimize/gradients/conv_3/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(3, 3, 64, 64) dtype=float32>, <tf.Variable 'optimize/gradients/conv_3/bias_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(64,) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_h/convolution/ExpandDims_1_grad/Reshape_accum_grad/var_accum_grad:0' shape=(128, 1, 150) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_l/convolution_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(40, 1, 1, 5) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_l_act/add_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(750,) dtype=float32>, <tf.Variable 'optimize/gradients/features/conv_l_act/mul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(750,) dtype=float32>, <tf.Variable 'optimize/gradients/input_linear/W_gradient_sum/AddN_accum_grad/var_accum_grad:0' shape=(24000, 512) dtype=float32>, <tf.Variable 'optimize/gradients/output/linear/dot/MatMul_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(512, 88) dtype=float32>, <tf.Variable 'optimize/gradients/output/linear/add_bias_grad/tuple/control_dependency_1_accum_grad/var_accum_grad:0' shape=(88,) dtype=float32>, <tf.Variable 'optimize/apply_grads/accum_grad_multiple_step/beta1_power:0' shape=() dtype=float32>, <tf.Variable 'optimize/apply_grads/accum_grad_multiple_step/beta2_power:0' shape=() dtype=float32>]. SprintSubprocessInstance: exec ['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--.python-control-enabled=true', '--.pymod-path=/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository', '--.pymod-name=returnn.sprint.control', '--.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--.configuration.channel=output-channel', '--.real-time-factor.channel=output-channel', '--.system-info.channel=output-channel', '--.time.channel=output-channel', '--.version.channel=output-channel', '--.log.channel=output-channel', '--.warning.channel=output-channel,', 'stderr', '--.error.channel=output-channel,', 'stderr', '--.statistics.channel=output-channel', '--.progress.channel=output-channel', '--.dot.channel=nil', '--.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--.model-combination.acoustic-model.state-tying.type=lookup', '--.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--.model-combination.acoustic-model.allophones.add-all=yes', '--.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--.model-combination.acoustic-model.hmm.states-per-phone=1', '--.model-combination.acoustic-model.hmm.state-repetitions=1', '--.model-combination.acoustic-model.hmm.across-word-model=yes', '--.model-combination.acoustic-model.hmm.early-recombination=no', '--.model-combination.acoustic-model.tdp.scale=1.0', '--.model-combination.acoustic-model.tdp..loop=0.0', '--.model-combination.acoustic-model.tdp..forward=0.0', '--.model-combination.acoustic-model.tdp..skip=infinity', '--.model-combination.acoustic-model.tdp..exit=0.0', '--.model-combination.acoustic-model.tdp.silence.loop=0.0', '--.model-combination.acoustic-model.tdp.silence.forward=0.0', '--.model-combination.acoustic-model.tdp.silence.skip=infinity', '--.model-combination.acoustic-model.tdp.silence.exit=0.0', '--.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--.model-combination.acoustic-model.phonology.history-length=0', '--.model-combination.acoustic-model.phonology.future-length=0', '--.transducer-builder-filter-out-invalid-allophones=yes', '--.fix-allophone-context-at-word-boundaries=yes', '--.allophone-state-graph-builder.topology=ctc', '--.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--.encoding=UTF-8', '--.output-channel.file=$(LOGFILE)', '--.output-channel.compressed=no', '--.output-channel.append=no', '--.output-channel.unbuffered=no', '--.LOGFILE=nn-trainer.loss.log', '--.TASK=1'] SprintSubprocessInstance: starting, pid 3325822 SprintSubprocessInstance: Sprint child process (['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--.python-control-enabled=true', '--.pymod-path=/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository', '--.pymod-name=returnn.sprint.control', '--.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--.configuration.channel=output-channel', '--.real-time-factor.channel=output-channel', '--.system-info.channel=output-channel', '--.time.channel=output-channel', '--.version.channel=output-channel', '--.log.channel=output-channel', '--.warning.channel=output-channel,', 'stderr', '--.error.channel=output-channel,', 'stderr', '--.statistics.channel=output-channel', '--.progress.channel=output-channel', '--.dot.channel=nil', '--.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--.model-combination.acoustic-model.state-tying.type=lookup', '--.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--.model-combination.acoustic-model.allophones.add-all=yes', '--.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--.model-combination.acoustic-model.hmm.states-per-phone=1', '--.model-combination.acoustic-model.hmm.state-repetitions=1', '--.model-combination.acoustic-model.hmm.across-word-model=yes', '--.model-combination.acoustic-model.hmm.early-recombination=no', '--.model-combination.acoustic-model.tdp.scale=1.0', '--.model-combination.acoustic-model.tdp..loop=0.0', '--.model-combination.acoustic-model.tdp..forward=0.0', '--.model-combination.acoustic-model.tdp..skip=infinity', '--.model-combination.acoustic-model.tdp..exit=0.0', '--.model-combination.acoustic-model.tdp.silence.loop=0.0', '--.model-combination.acoustic-model.tdp.silence.forward=0.0', '--.model-combination.acoustic-model.tdp.silence.skip=infinity', '--.model-combination.acoustic-model.tdp.silence.exit=0.0', '--.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--.model-combination.acoustic-model.phonology.history-length=0', '--.model-combination.acoustic-model.phonology.future-length=0', '--.transducer-builder-filter-out-invalid-allophones=yes', '--.fix-allophone-context-at-word-boundaries=yes', '--.allophone-state-graph-builder.topology=ctc', '--.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--.encoding=UTF-8', '--.output-channel.file=$(LOGFILE)', '--.output-channel.compressed=no', '--.output-channel.append=no', '--.output-channel.unbuffered=no', '--.LOGFILE=nn-trainer.loss.log', '--.TASK=1']) caused an exception. TensorFlow exception: Graph execution error:

Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in init self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in init self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last):

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load()

EOFError: Ran out of input

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in call ret = func(*args)

File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts))

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in init self.init()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed")

Exception: SprintSubprocessInstance Sprint init failed

[[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last):

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load()

EOFError: Ran out of input

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in call ret = func(*args)

File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts))

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in init self.init()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed")

Exception: SprintSubprocessInstance Sprint init failed

[[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored.

Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "/u/vieting/setups/swb/20230406_feat/work/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/rnn.py", line 11, in main() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 634, in main execute_main_task() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/main.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/updater.py", line 172, in init self.loss = network.get_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func return func(*args, kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler return fn(*args, *kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler return dispatch_target(args, kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func return py_func_common(func, inp, Tout, stateful, name=name) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common return _internal_py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func result = gen_script_ops.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in pyfunc , _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal ret = Operation.from_node_def(

Exception UnknownError() in step 0. (pid 3325131) Failing op: <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc> We tried to fetch the op inputs ([<tf.Tensor 'extern_data/placeholders/seq_tag/seq_tag:0' shape=(?,) dtype=string>]) but got another exception: target_op <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc>, ops [<tf.Operation 'extern_data/placeholders/seq_tag/seq_tag' type=Placeholder>] �[31;1mEXCEPTION�[0m �[34mTraceback (most recent call last):�[0m �[34;1mFile�[0m �[36m"/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/�[0m�[36;1msession.py�[0m�[36m"�[0m, �[34mline�[0m �[35m1379�[0m, �[34min�[0m BaseSession._do_call �[34mline:�[0m �[34mreturn�[0m fn�[34m(�[0m�[34m*�[0margs�[34m)�[0m �[34mlocals:�[0m fn �[34;1m=�[0m �[34m�[0m �[34m<�[0mfunction BaseSession�[34m.�[0m_do_run�[34m.�[0m�[34m<�[0mlocals�[34m>�[0m�[34m.�[0m_run_fn at 0x7f2192d77d30�[34m>�[0m args �[34;1m=�[0m �[34m�[0m �[34m(�[0m�[34m{�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2422de3eb0�[34m>�[0m�[34m:�[0m array�[34m(�[0m�[34m[�[0m�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m05505638�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m09610788�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m05115783�[34m]�[0m�[34m,�[0m �[34m.�[0m�[34m.�[0m�[34m.�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m]�[0m�[34m,�[0m

                        �[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m00226238�[34m]�[0m�[34m,�[0m
                         �[34m[�[0m�[34m-�[0m0�[34m.�[0m01049833�[34m]�[0m�[34m,�[0m
                         �[34m[�[0m�[34m-�[0m0�[34m.�[0m00...

�[34;1mFile�[0m �[36m"/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/�[0m�[36;1msession.py�[0m�[36m"�[0m, �[34mline�[0m �[35m1362�[0m, �[34min�[0m BaseSession._do_run.._run_fn �[34mline:�[0m �[34mreturn�[0m self�[34m.�[0m_call_tf_sessionrun�[34m(�[0moptions�[34m,�[0m feed_dict�[34m,�[0m fetch_list�[34m,�[0m target_list�[34m,�[0m run_metadata�[34m)�[0m �[34mlocals:�[0m self �[34;1m=�[0m �[34m�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0msession�[34m.�[0mSession object at 0x7f2571096ac0�[34m>�[0m self�[34;1m.�[0m_call_tf_sessionrun �[34;1m=�[0m �[34m�[0m �[34m<�[0mbound method BaseSession�[34m.�[0m_call_tf_sessionrun of �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0msession�[34m.�[0mSession object at 0x7f2571096ac0�[34m>�[0m�[34m>�[0m options �[34;1m=�[0m �[34m�[0m �[34mNone�[0m feed_dict �[34;1m=�[0m �[34m�[0m �[34m{�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2422de3eb0�[34m>�[0m�[34m:�[0m array�[34m(�[0m�[34m[�[0m�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m05505638�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m09610788�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m05115783�[34m]�[0m�[34m,�[0m �[34m.�[0m�[34m.�[0m�[34m.�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m]�[0m�[34m,�[0m

                             �[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m00226238�[34m]�[0m�[34m,�[0m
                              �[34m[�[0m�[34m-�[0m0�[34m.�[0m01049833�[34m]�[0m�[34m,�[0m
                              �[34m[�[0m�[34m-�[0m0�[34m.�[0m001...
  fetch_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f24250d81b0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423f96cf0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423b01830�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Ou...
  target_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa970�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa930�[34m>�[0m�[34m]�[0m
  run_metadata �[34;1m=�[0m �[34m<local>�[0m �[34mNone�[0m

�[34;1mFile�[0m �[36m"/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/�[0m�[36;1msession.py�[0m�[36m"�[0m, �[34mline�[0m �[35m1455�[0m, �[34min�[0m BaseSession._call_tf_sessionrun �[34mline:�[0m �[34mreturn�[0m tf_session�[34m.�[0mTF_SessionRun_wrapper�[34m(�[0mself�[34m.�[0m_session�[34m,�[0m options�[34m,�[0m feed_dict�[34m,�[0m fetch_list�[34m,�[0m target_list�[34m,�[0m run_metadata�[34m)�[0m �[34mlocals:�[0m tf_session �[34;1m=�[0m �[34m�[0m �[34m<�[0mmodule �[36m'tensorflow.python.client.pywrap_tf_session'�[0m �[34mfrom�[0m �[36m'/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/pywrap_tf_session.py'�[0m�[34m>�[0m tf_session�[34;1m.�[0mTF_SessionRun_wrapper �[34;1m=�[0m �[34m�[0m �[34m<�[0mbuilt�[34m-�[0m�[34min�[0m method TF_SessionRun_wrapper of PyCapsule object at 0x7f2538137300�[34m>�[0m self �[34;1m=�[0m �[34m�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0msession�[34m.�[0mSession object at 0x7f2571096ac0�[34m>�[0m self�[34;1m.�[0m_session �[34;1m=�[0m �[34m�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Session object at 0x7f2423372a70�[34m>�[0m options �[34;1m=�[0m �[34m�[0m �[34mNone�[0m feed_dict �[34;1m=�[0m �[34m�[0m �[34m{�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2422de3eb0�[34m>�[0m�[34m:�[0m array�[34m(�[0m�[34m[�[0m�[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m05505638�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m09610788�[34m]�[0m�[34m,�[0m �[34m[�[0m�[34m-�[0m0�[34m.�[0m05115783�[34m]�[0m�[34m,�[0m �[34m.�[0m�[34m.�[0m�[34m.�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m,�[0m �[34m[�[0m 0�[34m.�[0m �[34m]�[0m�[34m]�[0m�[34m,�[0m

                             �[34m[�[0m�[34m[�[0m�[34m-�[0m0�[34m.�[0m00226238�[34m]�[0m�[34m,�[0m
                              �[34m[�[0m�[34m-�[0m0�[34m.�[0m01049833�[34m]�[0m�[34m,�[0m
                              �[34m[�[0m�[34m-�[0m0�[34m.�[0m001...
  fetch_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f24250d81b0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423f96cf0�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Output object at 0x7f2423b01830�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Ou...
  target_list �[34;1m=�[0m �[34m<local>�[0m �[34m[�[0m�[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa970�[34m>�[0m�[34m,�[0m �[34m<�[0mtensorflow�[34m.�[0mpython�[34m.�[0mclient�[34m.�[0m_pywrap_tf_session�[34m.�[0mTF_Operation object at 0x7f24080fa930�[34m>�[0m�[34m]�[0m
  run_metadata �[34;1m=�[0m �[34m<local>�[0m �[34mNone�[0m

�[31mUnknownError�[0m: 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last):

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child ret = self._read()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read return Unpickler(p).load()

EOFError: Ran out of input

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in call ret = func(*args)

File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 511, in get_automata_for_batch instance = self._get_instance(i)

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 417, in _get_instance self._maybe_create_new_instance()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 405, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts))

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 80, in init self.init()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 302, in init self._start_child()

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 169, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed")

Exception: SprintSubprocessInstance Sprint init failed

[[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last):

File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core

Marvin84 commented 12 months ago
1 diff --git a/returnn/sprint/error_signals.py b/returnn/sprint/error_signals.py
  2 index 735ac363..1c204e68 100644
  3 --- a/returnn/sprint/error_signals.py
  4 +++ b/returnn/sprint/error_signals.py
  5 @@ -130,7 +130,7 @@ class SprintSubprocessInstance:
  6
  7      def _start_child(self):
  8          assert self.child_pid is None
  9 -        self.pipe_c2p = self._pipe_open()
 10 +        self.pipe_c2p = self._pipe_open(buffered=True)
 11          self.pipe_p2c = self._pipe_open()
 12          args = self._build_sprint_args()
 13          print("SprintSubprocessInstance: exec", args, file=log.v5)
 14 @@ -169,14 +169,14 @@ class SprintSubprocessInstance:
 15              raise Exception("SprintSubprocessInstance Sprint init failed")
 16
 17      # noinspection PyMethodMayBeStatic
 18 -    def _pipe_open(self):
 19 +    def _pipe_open(self, buffered=False):
 20          readend, writeend = os.pipe()
 21          if hasattr(os, "set_inheritable"):
 22              # https://www.python.org/dev/peps/pep-0446/
 23              os.set_inheritable(readend, True)
 24              os.set_inheritable(writeend, True)
 25 -        readend = os.fdopen(readend, "rb", 0)
 26 -        writeend = os.fdopen(writeend, "wb", 0)
 27 +        readend = os.fdopen(readend, "rb", -bool(buffered)) # -1 is default for buffered
 28 +        writeend = os.fdopen(writeend, "wb", -bool(buffered))
 29          return readend, writeend
 30
 31      @property
~                                                                                                 ~                                                                                                 ~                                                                                                 ~                                                                                                 ~                                                                                                   1 diff --git a/returnn/sprint/error_signals.py b/returnn/sprint/error_signals.py                  2 index 735ac363..1c204e68 100644                                                                 3 --- a/returnn/sprint/error_signals.py                                                           4 +++ b/returnn/sprint/error_signals.py                                                           5 @@ -130,7 +130,7 @@ class SprintSubprocessInstance:                                             6                                                                                                 7      def _start_child(self):                                                                    8          assert self.child_pid is None                                                          9 -        self.pipe_c2p = self._pipe_open()                                                     10 +        self.pipe_c2p = self._pipe_open(buffered=True)                                        11          self.pipe_p2c = self._pipe_open()                                                     12          args = self._build_sprint_args()                                                      13          print("SprintSubprocessInstance: exec", args, file=log.v5)                            14 @@ -169,14 +169,14 @@ class SprintSubprocessInstance:                                          15              raise Exception("SprintSubprocessInstance Sprint init failed")                    16                                                                                                17      # noinspection PyMethodMayBeStatic                                                        18 -    def _pipe_open(self):                                                                     19 +    def _pipe_open(self, buffered=False):                                                     20          readend, writeend = os.pipe()                                                         21          if hasattr(os, "set_inheritable"):                                                    22              # https://www.python.org/dev/peps/pep-0446/                                       23              os.set_inheritable(readend, True)                                                 24              os.set_inheritable(writeend, True)                                                25 -        readend = os.fdopen(readend, "rb", 0)                                                 26 -        writeend = os.fdopen(writeend, "wb", 0)                                               27 +        readend = os.fdopen(readend, "rb", -bool(buffered)) # -1 is default for buffered      28 +        writeend = os.fdopen(writeend, "wb", -bool(buffered))                                 29          return readend, writeend                                                              30                                                                                                31      @property                                                                                                                                                                                       
Marvin84 commented 12 months ago

AFAIR, the problem occurs when running in apptainer environment only. The buffer does not contain all info and returnn crashes because of rasr automata being truncated/ not complete

albertz commented 12 months ago

So for reference, the actual error is this:

Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch'
2 root error(s) found.
  (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed
Traceback (most recent call last):

  File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 164, in _start_child
    ret = self._read()

  File "/work/asr4/vieting/setups/swb/work/20230406_feat/i6_core/tools/git/CloneGitRepositoryJob.Sc1EzS78fRSC/output/repository/returnn/sprint/error_signals.py", line 225, in _read
    return Unpickler(p).load()

EOFError: Ran out of input
vieting commented 12 months ago

I just tested the proposed patch and it does not fix the issue for my example.

albertz commented 12 months ago

Can you link the full patch? It seems incomplete here.

Marvin84 commented 12 months ago

Can you link the full patch? It seems incomplete here.

Sure, just edited the comment.

albertz commented 12 months ago

@vieting I pushed sth which should fix this. Can you try?

albertz commented 12 months ago

(For reference, there was also an EOFError in #1363, but I think that was another problem.)

albertz commented 12 months ago

Note: I did not actually test my recent change, as I don't have any setup ready to try this. Please try it out and report if it works.

vieting commented 12 months ago

Just tested and I still get the error.

Log:

``` RETURNN starting up, version 1.20231108.124950+git.a3d1094d, date/time 2023-11-08-14-13-24 (UTC+0100), pid 352402, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3 RETURNN command line options: ['returnn.config'] Hostname: cn-283 TensorFlow: 2.13.0 (v2.13.0-rc2-7-g1cb1a030a62) ( in /usr/local/lib/python3.8/dist-packages/tensorflow) Use num_threads=1 (but min 2) via OMP_NUM_THREADS. Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}. CUDA_VISIBLE_DEVICES is set to '4'. Collecting TensorFlow device list... Local devices available to TensorFlow: 1/2: name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 14088248937803725314 xla_global_id: -1 2/2: name: "/device:GPU:0" device_type: "GPU" memory_limit: 10089005056 locality { bus_id: 2 numa_node: 1 links { } } incarnation: 17654959729817767865 physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:81:00.0, compute capability: 7.5" xla_global_id: 416903419 Using gpu device 4: NVIDIA GeForce RTX 2080 Ti Hostname 'cn-283', GPU 4, GPU-dev-name 'NVIDIA GeForce RTX 2080 Ti', GPU-memory 9.4GB Train data: input: 1 x 1 output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]} OggZipDataset, sequences: 249229, frames: unknown Dev data: OggZipDataset, sequences: 300, frames: unknown Learning-rate-control: file learning_rates.swb.ctc does not exist yet Setup TF session with options {'log_device_placement': False, 'device_count': {'GPU': 1}} ... layer /'data': [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)] float32 layer /features/'conv_h_filter': ['conv_h_filter:static:0'(128),'conv_h_filter:static:1'(1),F|F'conv_h_filter:static:2'(150)] float32 layer /features/'conv_h': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_act': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_split': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F'conv_h:channel'(150),F|F'conv_h_split_split_dims1'(1)] float32 DEPRECATION WARNING: Explicitly specify in_spatial_dims when there is more than one spatial dim in the input. This will be disallowed with behavior_version 8. layer /features/'conv_l': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel'(150),F|F'conv_l:channel'(5)] float32 layer /features/'conv_l_merge': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 DEPRECATION WARNING: MergeDimsLayer, only keep_order=True is allowed This will be disallowed with behavior_version 6. layer /features/'conv_l_act_no_norm': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /features/'conv_l_act': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /features/'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'features': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'specaug': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'conv_source': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_source_split_dims1'(1)] float32 layer /'conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_1:channel'(32)] float32 layer /'conv_1_pool': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_1:channel'(32)] float32 layer /'conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/32⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_2:channel'(64)] float32 layer /'conv_3': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_3:channel'(64)] float32 layer /'conv_merged': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conv_h:channel*conv_l:channel//2)*conv_3:channel'(24000)] float32 layer /'input_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'input_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_1_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_pointwise_conv_1:feature-dense'(1024)] float32 layer /'conformer_1_conv_mod_glu': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conformer_1_conv_mod_pointwise_conv_1:feature-dense)//2'(512)] float32 layer /'conformer_1_conv_mod_depthwise_conv': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_bn': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 DEPRECATION WARNING: batch_norm masked_time should be specified explicitly This will be disallowed with behavior_version 12. layer /'conformer_1_conv_mod_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_relpos_encoding': [T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_relpos_encoding_rel_pos_enc_feat'(64)] float32 layer /'conformer_1_mhsa_mod_self_attention': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_att_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_2_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'encoder': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'output:feature-dense'(88)] float32 Network layer topology: extern data: data: Tensor{[B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]}, seq_tag: Tensor{[B?], dtype='string'} used data keys: ['data', 'seq_tag'] layers: layer batch_norm 'conformer_1_conv_mod_bn' #: 512 layer conv 'conformer_1_conv_mod_depthwise_conv' #: 512 layer copy 'conformer_1_conv_mod_dropout' #: 512 layer gating 'conformer_1_conv_mod_glu' #: 512 layer layer_norm 'conformer_1_conv_mod_ln' #: 512 layer linear 'conformer_1_conv_mod_pointwise_conv_1' #: 1024 layer linear 'conformer_1_conv_mod_pointwise_conv_2' #: 512 layer combine 'conformer_1_conv_mod_res_add' #: 512 layer activation 'conformer_1_conv_mod_swish' #: 512 layer copy 'conformer_1_ffmod_1_dropout' #: 512 layer linear 'conformer_1_ffmod_1_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_1_half_res_add' #: 512 layer linear 'conformer_1_ffmod_1_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_1_ln' #: 512 layer copy 'conformer_1_ffmod_2_dropout' #: 512 layer linear 'conformer_1_ffmod_2_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_2_half_res_add' #: 512 layer linear 'conformer_1_ffmod_2_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_2_ln' #: 512 layer linear 'conformer_1_mhsa_mod_att_linear' #: 512 layer copy 'conformer_1_mhsa_mod_dropout' #: 512 layer layer_norm 'conformer_1_mhsa_mod_ln' #: 512 layer relative_positional_encoding 'conformer_1_mhsa_mod_relpos_encoding' #: 64 layer combine 'conformer_1_mhsa_mod_res_add' #: 512 layer self_attention 'conformer_1_mhsa_mod_self_attention' #: 512 layer layer_norm 'conformer_1_output' #: 512 layer conv 'conv_1' #: 32 layer pool 'conv_1_pool' #: 32 layer conv 'conv_2' #: 64 layer conv 'conv_3' #: 64 layer merge_dims 'conv_merged' #: 24000 layer split_dims 'conv_source' #: 1 layer source 'data' #: 1 layer copy 'encoder' #: 512 layer subnetwork 'features' #: 750 layer conv 'features/conv_h' #: 150 layer eval 'features/conv_h_act' #: 150 layer variable 'features/conv_h_filter' #: 150 layer split_dims 'features/conv_h_split' #: 1 layer conv 'features/conv_l' #: 5 layer layer_norm 'features/conv_l_act' #: 750 layer eval 'features/conv_l_act_no_norm' #: 750 layer merge_dims 'features/conv_l_merge' #: 750 layer copy 'features/output' #: 750 layer copy 'input_dropout' #: 512 layer linear 'input_linear' #: 512 layer softmax 'output' #: 88 layer eval 'specaug' #: 750 net params #: 18473980 net trainable params: [, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ] start training at epoch 1 using batch size: {'classes': 5000, 'data': 400000}, max seqs: 128 learning rate control: NewbobMultiEpoch(num_epochs=6, update_interval=1, relative_error_threshold=-0.01, relative_error_grow_threshold=-0.01), epoch data: 1: EpochData(learningRate=1.325e-05, error={}), 2: EpochData(learningRate=1.539861111111111e-05, error={}), 3: EpochData(learningRate=1.754722222222222e-05, error={}), ..., 360: EpochData(learningRate=1.4333333333333375e-05, error={}), 361: EpochData(learningRate=1.2166666666666727e-05, error={}), 362: EpochData(learningRate=1e-05, error={}), error key: None pretrain: None start epoch 1 with learning rate 1.325e-05 ... TF: log_dir: output/models/train-2023-11-08-13-13-24 Create optimizer with options {'epsilon': 1e-08, 'learning_rate': }. Initialize optimizer (default) with slots ['m', 'v']. These additional variable were created by the optimizer: [, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]. SprintSubprocessInstance: exec ['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=no', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1'] SprintSubprocessInstance: starting, pid 353093 SprintSubprocessInstance: Sprint child process (['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=no', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1']) caused an exception. TensorFlow exception: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 166, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 235, in _read raise EOFError EOFError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 533, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 439, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 427, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 82, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 324, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 171, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 166, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 235, in _read raise EOFError EOFError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 533, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 439, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 427, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 82, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 324, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 171, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler return dispatch_target(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func return py_func_common(func, inp, Tout, stateful, name=name) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common return _internal_py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func result = gen_script_ops.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func _, _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal ret = Operation.from_node_def( Exception UnknownError() in step 0. (pid 352402) Failing op: We tried to fetch the op inputs ([]) but got another exception: target_op , ops [] EXCEPTION Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1379, in BaseSession._do_call line: return fn(*args) locals: fn =  <function BaseSession._do_run.<locals>._run_fn at 0x7f3307fe4f70> args =  ({<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f35983ad7b0>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.00... File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1362, in BaseSession._do_run.._run_fn line: return self._call_tf_sessionrun(options, feed_dict, fetch_list, target_list, run_metadata) locals: self =  <tensorflow.python.client.session.Session object at 0x7f36e7563ac0> self._call_tf_sessionrun =  <bound method BaseSession._call_tf_sessionrun of <tensorflow.python.client.session.Session object at 0x7f36e7563ac0>> options =  None feed_dict =  {<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f35983ad7b0>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f35893975b0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f35893a4ef0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f3589379470>, <tensorflow.python.client._pywrap_tf_session.TF_Ou... target_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f35917f95b0>, <tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f35917f9770>] run_metadata =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1455, in BaseSession._call_tf_sessionrun line: return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, fetch_list, target_list, run_metadata) locals: tf_session =  <module 'tensorflow.python.client.pywrap_tf_session' from '/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/pywrap_tf_session.py'> tf_session.TF_SessionRun_wrapper =  <built-in method TF_SessionRun_wrapper of PyCapsule object at 0x7f36aecb2300> self =  <tensorflow.python.client.session.Session object at 0x7f36e7563ac0> self._session =  <tensorflow.python.client._pywrap_tf_session.TF_Session object at 0x7f35986e9470> options =  None feed_dict =  {<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f35983ad7b0>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f35893975b0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f35893a4ef0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f3589379470>, <tensorflow.python.client._pywrap_tf_session.TF_Ou... target_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f35917f95b0>, <tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f35917f9770>] run_metadata =  None UnknownError: 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 166, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 235, in _read raise EOFError EOFError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 533, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 439, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 427, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 82, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 324, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 171, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 166, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 235, in _read raise EOFError EOFError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 533, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 439, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 427, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 82, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 324, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 171, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 744, in Runner.run line: fetches_results = sess.run( fetches_dict, feed_dict=feed_dict, options=run_options ) # type: typing.Dict[str,typing.Union[numpy.ndarray,str]] locals: fetches_results =  sess =  <tensorflow.python.client.session.Session object at 0x7f36e7563ac0> sess.run =  <bound method BaseSession.run of <tensorflow.python.client.session.Session object at 0x7f36e7563ac0>> fetches_dict =  {'size:data:0': <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>, 'loss': <tf.Tensor 'objective/add:0' shape=() dtype=float32>, 'cost:output': <tf.Tensor 'objective/loss/loss/FastBaumWelchLoss/generic_loss_and_error_signal:0' shape=() dtype=float32>, 'loss_norm_..., len = 8 feed_dict =  {<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options =  run_options =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 969, in BaseSession.run line: result = self._run(None, fetches, feed_dict, options_ptr, run_metadata_ptr) locals: result =  self =  <tensorflow.python.client.session.Session object at 0x7f36e7563ac0> self._run =  <bound method BaseSession._run of <tensorflow.python.client.session.Session object at 0x7f36e7563ac0>> fetches =  {'size:data:0': <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>, 'loss': <tf.Tensor 'objective/add:0' shape=() dtype=float32>, 'cost:output': <tf.Tensor 'objective/loss/loss/FastBaumWelchLoss/generic_loss_and_error_signal:0' shape=() dtype=float32>, 'loss_norm_..., len = 8 feed_dict =  {<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options_ptr =  None run_metadata_ptr =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1192, in BaseSession._run line: results = self._do_run(handle, final_targets, final_fetches, feed_dict_tensor, options, run_metadata) locals: results =  self =  <tensorflow.python.client.session.Session object at 0x7f36e7563ac0> self._do_run =  <bound method BaseSession._do_run of <tensorflow.python.client.session.Session object at 0x7f36e7563ac0>> handle =  None final_targets =  [<tf.Operation 'conformer_1_conv_mod_bn/batch_norm/cond/Merge_1' type=Merge>, <tf.Operation 'optim_and_step_incr' type=NoOp>] final_fetches =  [<tf.Tensor 'objective/add:0' shape=() dtype=float32>, <tf.Tensor 'objective/loss/loss/FastBaumWelchLoss/generic_loss_and_error_signal:0' shape=() dtype=float32>, <tf.Tensor 'objective/loss/loss_init/truediv:0' shape=() dtype=float32>, <tf.Tensor 'globals/mem_usage_deviceGPU0:0' shape=() dtype=in... feed_dict_tensor =  {<Reference wrapping <tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049... options =  None run_metadata =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1372, in BaseSession._do_run line: return self._do_call(_run_fn, feeds, fetches, targets, options, run_metadata) locals: self =  <tensorflow.python.client.session.Session object at 0x7f36e7563ac0> self._do_call =  <bound method BaseSession._do_call of <tensorflow.python.client.session.Session object at 0x7f36e7563ac0>> _run_fn =  <function BaseSession._do_run.<locals>._run_fn at 0x7f3307fe4f70> feeds =  {<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f35983ad7b0>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetches =  [<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f35893975b0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f35893a4ef0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f3589379470>, <tensorflow.python.client._pywrap_tf_session.TF_Ou... targets =  [<tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f35917f95b0>, <tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f35917f9770>] options =  None run_metadata =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1398, in BaseSession._do_call line: raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter locals: type =  <class 'type'> e =  node_def =  name: "objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch" op: "PyFunc" input: "extern_data/placeholders/seq_tag/seq_tag" attr { key: "token" value { s: "pyfunc_0" } } attr { key: "Tout" value { list { type: DT_INT32 type: DT_FLOAT type: DT_INT... op =  <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc> message =  'Graph execution error:\n\nDetected at node \'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch\' defined at (most recent call last):\n File "./returnn/rnn.py", line 11, in \n main()\n File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__mai..., len = 11284 UnknownError: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 166, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 235, in _read raise EOFError EOFError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 533, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 439, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 427, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 82, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 324, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 171, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 166, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 235, in _read raise EOFError EOFError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 533, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 439, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 427, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 82, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 324, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 171, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler return dispatch_target(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func return py_func_common(func, inp, Tout, stateful, name=name) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common return _internal_py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func result = gen_script_ops.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func _, _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal ret = Operation.from_node_def( During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4341, in help_on_tf_exception line: debug_fetch, fetch_helpers, op_copied = FetchHelper.copy_graph( debug_fetch, target_op=op, fetch_helper_tensors=list(op.inputs), stop_at_ts=stop_at_ts, verbose_stream=file, ) locals: debug_fetch =  <tf.Operation 'extern_data/placeholders/seq_tag/seq_tag' type=Placeholder> fetch_helpers =  op_copied =  FetchHelper =  <class 'returnn.tf.util.basic.FetchHelper'> FetchHelper.copy_graph =  <bound method FetchHelper.copy_graph of <class 'returnn.tf.util.basic.FetchHelper'>> target_op =  op =  <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc> fetch_helper_tensors =  list =  <class 'list'> op.inputs =  (<tf.Tensor 'extern_data/placeholders/seq_tag/seq_tag:0' shape=(?,) dtype=string>,) stop_at_ts =  [<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>, <tf.Tensor 'extern_data/placeholders/seq_tag/seq_tag:0' shape=(?,) dtype=string>, <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>, <tf.Tensor 'extern_data/placeholders/batch_dim:... verbose_stream =  file =  <returnn.log.Stream object at 0x7f36e7695df0> File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/util/basic.py", line 7700, in FetchHelper.copy_graph line: assert target_op in ops, "target_op %r,\nops\n%s" % (target_op, pformat(ops)) locals: target_op =  <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc> ops =  [<tf.Operation 'extern_data/placeholders/seq_tag/seq_tag' type=Placeholder>] pformat =  <function pformat at 0x7f36eb9e5c10> AssertionError: target_op , ops [] Step meta information: {'seq_idx': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38], 'seq_tag': ['switchboard-1/sw02721B/sw2721B-ms98-a-0031', 'switchboard-1/sw02427A/sw2427A-ms98-a-0021', 'switchboard-1/sw02848B/sw2848B-ms98-a-0086', 'switchboard-1/sw04037A/sw4037A-ms98-a-0027', 'switchboard-1/sw02370B/sw2370B-ms98-a-0117', 'switchboard-1/sw02145A/sw2145A-ms98-a-0107', 'switchboard-1/sw02484A/sw2484A-ms98-a-0077', 'switchboard-1/sw02768A/sw2768A-ms98-a-0064', 'switchboard-1/sw03312B/sw3312B-ms98-a-0041', 'switchboard-1/sw02344B/sw2344B-ms98-a-0023', 'switchboard-1/sw04248B/sw4248B-ms98-a-0017', 'switchboard-1/sw02762A/sw2762A-ms98-a-0059', 'switchboard-1/sw03146A/sw3146A-ms98-a-0047', 'switchboard-1/sw03032A/sw3032A-ms98-a-0065', 'switchboard-1/sw02288A/sw2288A-ms98-a-0080', 'switchboard-1/sw02751A/sw2751A-ms98-a-0066', 'switchboard-1/sw02369A/sw2369A-ms98-a-0118', 'switchboard-1/sw04169A/sw4169A-ms98-a-0059', 'switchboard-1/sw02227A/sw2227A-ms98-a-0016', 'switchboard-1/sw02061B/sw2061B-ms98-a-0170', 'switchboard-1/sw02862B/sw2862B-ms98-a-0033', 'switchboard-1/sw03116B/sw3116B-ms98-a-0065', 'switchboard-1/sw03517B/sw3517B-ms98-a-0038', 'switchboard-1/sw02360B/sw2360B-ms98-a-0086', 'switchboard-1/sw02510B/sw2510B-ms98-a-0061', 'switchboard-1/sw03919A/sw3919A-ms98-a-0017', 'switchboard-1/sw02965A/sw2965A-ms98-a-0045', 'switchboard-1/sw03154A/sw3154A-ms98-a-0073', 'switchboard-1/sw02299A/sw2299A-ms98-a-0005', 'switchboard-1/sw04572A/sw4572A-ms98-a-0026', 'switchboard-1/sw02682A/sw2682A-ms98-a-0022', 'switchboard-1/sw02808A/sw2808A-ms98-a-0014', 'switchboard-1/sw04526A/sw4526A-ms98-a-0026', 'switchboard-1/sw03180B/sw3180B-ms98-a-0010', 'switchboard-1/sw03227A/sw3227A-ms98-a-0029', 'switchboard-1/sw03891B/sw3891B-ms98-a-0008', 'switchboard-1/sw03882B/sw3882B-ms98-a-0041', 'switchboard-1/sw03102B/sw3102B-ms98-a-0027', 'switchboard-1/sw02454A/sw2454A-ms98-a-0029']} Feed dict: : int(39) : shape (39, 10208, 1), dtype float32, min/max -1.0/1.0, mean/stddev 0.0014351769/0.11459725, Tensor{'data', [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]} : shape (39,), dtype int32, min/max 4760/10208, ([ 4760 6246 6372 6861 7296 7499 7534 7622 7824 8031 8295 8431 8690 8675 8667 8886 9084 9199 9163 9156 9274 9262 9540 9668 9678 9719 9711 9902 9989 10010 10020 10073 10006 10102 10131 10112 10130 10178 10208]) : type , Tensor{'seq_tag', [B?], dtype='string'} : bool(True) Save model under output/models/epoch.001.crash_0 Trainer not finalized, quitting. (pid 352402) ```
vieting commented 12 months ago

@albertz check /work/asr4/vieting/tmp/20231108_tf213_sprint_op/run_example.sh if you want to test it yourself.

Marvin84 commented 12 months ago

@christophmluscher @NeoLegends does this relate to the rasr compiled with TF 2.13? Do you recognize this error?

vieting commented 12 months ago

Is it maybe a problem that RASR was compiled with my old tf 2.8 image? I still use the same RASR binary with the new image. Loading the automata does not require tf, so I thought, that I can use the same RASR.

albertz commented 12 months ago

@vieting I pushed another small change. Can you try again?

vieting commented 12 months ago

I pushed another small change. Can you try again?

Unfortunately, this still does not fix my example.

Traceback (most recent call last):

  File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child
    ret = self._read()                                                                               

  File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read                                                                                       
    return util.read_pickled_object(p)                                                                                                                                                                     

  File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object
    size_raw = read_bytes_to_new_buffer(p, 4).getvalue()                  

  File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer
    raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size))

EOFError: expected to read 4 bytes but got EOF after 0 bytes
``` RETURNN starting up, version 1.20231108.140626+git.9fe93590, date/time 2023-11-08-15-13-28 (UTC+0100), pid 356353, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3 RETURNN command line options: ['returnn.config'] Hostname: cn-283 TensorFlow: 2.13.0 (v2.13.0-rc2-7-g1cb1a030a62) ( in /usr/local/lib/python3.8/dist-packages/tensorflow) Use num_threads=1 (but min 2) via OMP_NUM_THREADS. Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}. CUDA_VISIBLE_DEVICES is set to '4'. Collecting TensorFlow device list... Local devices available to TensorFlow: 1/2: name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 13595377529408947728 xla_global_id: -1 2/2: name: "/device:GPU:0" device_type: "GPU" memory_limit: 10089005056 locality { bus_id: 2 numa_node: 1 links { } } incarnation: 17849739553926303687 physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:81:00.0, compute capability: 7.5" xla_global_id: 416903419 Using gpu device 4: NVIDIA GeForce RTX 2080 Ti Hostname 'cn-283', GPU 4, GPU-dev-name 'NVIDIA GeForce RTX 2080 Ti', GPU-memory 9.4GB Train data: input: 1 x 1 output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]} OggZipDataset, sequences: 249229, frames: unknown Dev data: OggZipDataset, sequences: 300, frames: unknown Learning-rate-control: file learning_rates.swb.ctc does not exist yet Setup TF session with options {'log_device_placement': False, 'device_count': {'GPU': 1}} ... layer /'data': [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)] float32 layer /features/'conv_h_filter': ['conv_h_filter:static:0'(128),'conv_h_filter:static:1'(1),F|F'conv_h_filter:static:2'(150)] float32 layer /features/'conv_h': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_act': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_split': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F'conv_h:channel'(150),F|F'conv_h_split_split_dims1'(1)] float32 DEPRECATION WARNING: Explicitly specify in_spatial_dims when there is more than one spatial dim in the input. This will be disallowed with behavior_version 8. layer /features/'conv_l': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel'(150),F|F'conv_l:channel'(5)] float32 layer /features/'conv_l_merge': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 DEPRECATION WARNING: MergeDimsLayer, only keep_order=True is allowed This will be disallowed with behavior_version 6. layer /features/'conv_l_act_no_norm': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /features/'conv_l_act': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /features/'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'features': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'specaug': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'conv_source': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_source_split_dims1'(1)] float32 layer /'conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_1:channel'(32)] float32 layer /'conv_1_pool': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_1:channel'(32)] float32 layer /'conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/32⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_2:channel'(64)] float32 layer /'conv_3': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_3:channel'(64)] float32 layer /'conv_merged': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conv_h:channel*conv_l:channel//2)*conv_3:channel'(24000)] float32 layer /'input_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'input_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_1_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_pointwise_conv_1:feature-dense'(1024)] float32 layer /'conformer_1_conv_mod_glu': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conformer_1_conv_mod_pointwise_conv_1:feature-dense)//2'(512)] float32 layer /'conformer_1_conv_mod_depthwise_conv': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_bn': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 DEPRECATION WARNING: batch_norm masked_time should be specified explicitly This will be disallowed with behavior_version 12. layer /'conformer_1_conv_mod_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_relpos_encoding': [T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_relpos_encoding_rel_pos_enc_feat'(64)] float32 layer /'conformer_1_mhsa_mod_self_attention': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_att_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_2_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'encoder': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'output:feature-dense'(88)] float32 Network layer topology: extern data: data: Tensor{[B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]}, seq_tag: Tensor{[B?], dtype='string'} used data keys: ['data', 'seq_tag'] layers: layer batch_norm 'conformer_1_conv_mod_bn' #: 512 layer conv 'conformer_1_conv_mod_depthwise_conv' #: 512 layer copy 'conformer_1_conv_mod_dropout' #: 512 layer gating 'conformer_1_conv_mod_glu' #: 512 layer layer_norm 'conformer_1_conv_mod_ln' #: 512 layer linear 'conformer_1_conv_mod_pointwise_conv_1' #: 1024 layer linear 'conformer_1_conv_mod_pointwise_conv_2' #: 512 layer combine 'conformer_1_conv_mod_res_add' #: 512 layer activation 'conformer_1_conv_mod_swish' #: 512 layer copy 'conformer_1_ffmod_1_dropout' #: 512 layer linear 'conformer_1_ffmod_1_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_1_half_res_add' #: 512 layer linear 'conformer_1_ffmod_1_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_1_ln' #: 512 layer copy 'conformer_1_ffmod_2_dropout' #: 512 layer linear 'conformer_1_ffmod_2_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_2_half_res_add' #: 512 layer linear 'conformer_1_ffmod_2_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_2_ln' #: 512 layer linear 'conformer_1_mhsa_mod_att_linear' #: 512 layer copy 'conformer_1_mhsa_mod_dropout' #: 512 layer layer_norm 'conformer_1_mhsa_mod_ln' #: 512 layer relative_positional_encoding 'conformer_1_mhsa_mod_relpos_encoding' #: 64 layer combine 'conformer_1_mhsa_mod_res_add' #: 512 layer self_attention 'conformer_1_mhsa_mod_self_attention' #: 512 layer layer_norm 'conformer_1_output' #: 512 layer conv 'conv_1' #: 32 layer pool 'conv_1_pool' #: 32 layer conv 'conv_2' #: 64 layer conv 'conv_3' #: 64 layer merge_dims 'conv_merged' #: 24000 layer split_dims 'conv_source' #: 1 layer source 'data' #: 1 layer copy 'encoder' #: 512 layer subnetwork 'features' #: 750 layer conv 'features/conv_h' #: 150 layer eval 'features/conv_h_act' #: 150 layer variable 'features/conv_h_filter' #: 150 layer split_dims 'features/conv_h_split' #: 1 layer conv 'features/conv_l' #: 5 layer layer_norm 'features/conv_l_act' #: 750 layer eval 'features/conv_l_act_no_norm' #: 750 layer merge_dims 'features/conv_l_merge' #: 750 layer copy 'features/output' #: 750 layer copy 'input_dropout' #: 512 layer linear 'input_linear' #: 512 layer softmax 'output' #: 88 layer eval 'specaug' #: 750 net params #: 18473980 net trainable params: [, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ] start training at epoch 1 using batch size: {'classes': 5000, 'data': 400000}, max seqs: 128 learning rate control: NewbobMultiEpoch(num_epochs=6, update_interval=1, relative_error_threshold=-0.01, relative_error_grow_threshold=-0.01), epoch data: 1: EpochData(learningRate=1.325e-05, error={}), 2: EpochData(learningRate=1.539861111111111e-05, error={}), 3: EpochData(learningRate=1.754722222222222e-05, error={}), ..., 360: EpochData(learningRate=1.4333333333333375e-05, error={}), 361: EpochData(learningRate=1.2166666666666727e-05, error={}), 362: EpochData(learningRate=1e-05, error={}), error key: None pretrain: None start epoch 1 with learning rate 1.325e-05 ... TF: log_dir: output/models/train-2023-11-08-14-13-28 Create optimizer with options {'epsilon': 1e-08, 'learning_rate': }. Initialize optimizer (default) with slots ['m', 'v']. These additional variable were created by the optimizer: [, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]. SprintSubprocessInstance: exec ['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=no', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1'] SprintSubprocessInstance: starting, pid 356974 SprintSubprocessInstance: Sprint child process (['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:37,p2c_fd:38,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=no', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1']) caused an exception. TensorFlow exception: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler return dispatch_target(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func return py_func_common(func, inp, Tout, stateful, name=name) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common return _internal_py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func result = gen_script_ops.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func _, _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal ret = Operation.from_node_def( Exception UnknownError() in step 0. (pid 356353) Failing op: We tried to fetch the op inputs ([]) but got another exception: target_op , ops [] EXCEPTION Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1379, in BaseSession._do_call line: return fn(*args) locals: fn =  <function BaseSession._do_run.<locals>._run_fn at 0x7f4267b80c10> args =  ({<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44f80b9630>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.00... File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1362, in BaseSession._do_run.._run_fn line: return self._call_tf_sessionrun(options, feed_dict, fetch_list, target_list, run_metadata) locals: self =  <tensorflow.python.client.session.Session object at 0x7f46458c3d60> self._call_tf_sessionrun =  <bound method BaseSession._call_tf_sessionrun of <tensorflow.python.client.session.Session object at 0x7f46458c3d60>> options =  None feed_dict =  {<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44f80b9630>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44f2b68ef0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44f2b688b0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44ef901eb0>, <tensorflow.python.client._pywrap_tf_session.TF_Ou... target_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f44eaac5d70>, <tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f44eaac5db0>] run_metadata =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1455, in BaseSession._call_tf_sessionrun line: return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, fetch_list, target_list, run_metadata) locals: tf_session =  <module 'tensorflow.python.client.pywrap_tf_session' from '/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/pywrap_tf_session.py'> tf_session.TF_SessionRun_wrapper =  <built-in method TF_SessionRun_wrapper of PyCapsule object at 0x7f46444243f0> self =  <tensorflow.python.client.session.Session object at 0x7f46458c3d60> self._session =  <tensorflow.python.client._pywrap_tf_session.TF_Session object at 0x7f44f83404f0> options =  None feed_dict =  {<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44f80b9630>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44f2b68ef0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44f2b688b0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44ef901eb0>, <tensorflow.python.client._pywrap_tf_session.TF_Ou... target_list =  [<tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f44eaac5d70>, <tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f44eaac5db0>] run_metadata =  None UnknownError: 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 744, in Runner.run line: fetches_results = sess.run( fetches_dict, feed_dict=feed_dict, options=run_options ) # type: typing.Dict[str,typing.Union[numpy.ndarray,str]] locals: fetches_results =  sess =  <tensorflow.python.client.session.Session object at 0x7f46458c3d60> sess.run =  <bound method BaseSession.run of <tensorflow.python.client.session.Session object at 0x7f46458c3d60>> fetches_dict =  {'size:data:0': <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>, 'loss': <tf.Tensor 'objective/add:0' shape=() dtype=float32>, 'cost:output': <tf.Tensor 'objective/loss/loss/FastBaumWelchLoss/generic_loss_and_error_signal:0' shape=() dtype=float32>, 'loss_norm_..., len = 8 feed_dict =  {<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options =  run_options =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 969, in BaseSession.run line: result = self._run(None, fetches, feed_dict, options_ptr, run_metadata_ptr) locals: result =  self =  <tensorflow.python.client.session.Session object at 0x7f46458c3d60> self._run =  <bound method BaseSession._run of <tensorflow.python.client.session.Session object at 0x7f46458c3d60>> fetches =  {'size:data:0': <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>, 'loss': <tf.Tensor 'objective/add:0' shape=() dtype=float32>, 'cost:output': <tf.Tensor 'objective/loss/loss/FastBaumWelchLoss/generic_loss_and_error_signal:0' shape=() dtype=float32>, 'loss_norm_..., len = 8 feed_dict =  {<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options_ptr =  None run_metadata_ptr =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1192, in BaseSession._run line: results = self._do_run(handle, final_targets, final_fetches, feed_dict_tensor, options, run_metadata) locals: results =  self =  <tensorflow.python.client.session.Session object at 0x7f46458c3d60> self._do_run =  <bound method BaseSession._do_run of <tensorflow.python.client.session.Session object at 0x7f46458c3d60>> handle =  None final_targets =  [<tf.Operation 'conformer_1_conv_mod_bn/batch_norm/cond/Merge_1' type=Merge>, <tf.Operation 'optim_and_step_incr' type=NoOp>] final_fetches =  [<tf.Tensor 'objective/add:0' shape=() dtype=float32>, <tf.Tensor 'objective/loss/loss/FastBaumWelchLoss/generic_loss_and_error_signal:0' shape=() dtype=float32>, <tf.Tensor 'objective/loss/loss_init/truediv:0' shape=() dtype=float32>, <tf.Tensor 'globals/mem_usage_deviceGPU0:0' shape=() dtype=in... feed_dict_tensor =  {<Reference wrapping <tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049... options =  None run_metadata =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1372, in BaseSession._do_run line: return self._do_call(_run_fn, feeds, fetches, targets, options, run_metadata) locals: self =  <tensorflow.python.client.session.Session object at 0x7f46458c3d60> self._do_call =  <bound method BaseSession._do_call of <tensorflow.python.client.session.Session object at 0x7f46458c3d60>> _run_fn =  <function BaseSession._do_run.<locals>._run_fn at 0x7f4267b80c10> feeds =  {<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44f80b9630>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetches =  [<tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44f2b68ef0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44f2b688b0>, <tensorflow.python.client._pywrap_tf_session.TF_Output object at 0x7f44ef901eb0>, <tensorflow.python.client._pywrap_tf_session.TF_Ou... targets =  [<tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f44eaac5d70>, <tensorflow.python.client._pywrap_tf_session.TF_Operation object at 0x7f44eaac5db0>] options =  None run_metadata =  None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1398, in BaseSession._do_call line: raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter locals: type =  <class 'type'> e =  node_def =  name: "objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch" op: "PyFunc" input: "extern_data/placeholders/seq_tag/seq_tag" attr { key: "token" value { s: "pyfunc_0" } } attr { key: "Tout" value { list { type: DT_INT32 type: DT_FLOAT type: DT_INT... op =  <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc> message =  'Graph execution error:\n\nDetected at node \'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch\' defined at (most recent call last):\n File "./returnn/rnn.py", line 11, in \n main()\n File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__mai..., len = 12234 UnknownError: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler return dispatch_target(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func return py_func_common(func, inp, Tout, stateful, name=name) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common return _internal_py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func result = gen_script_ops.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func _, _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal ret = Operation.from_node_def( During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4341, in help_on_tf_exception line: debug_fetch, fetch_helpers, op_copied = FetchHelper.copy_graph( debug_fetch, target_op=op, fetch_helper_tensors=list(op.inputs), stop_at_ts=stop_at_ts, verbose_stream=file, ) locals: debug_fetch =  <tf.Operation 'extern_data/placeholders/seq_tag/seq_tag' type=Placeholder> fetch_helpers =  op_copied =  FetchHelper =  <class 'returnn.tf.util.basic.FetchHelper'> FetchHelper.copy_graph =  <bound method FetchHelper.copy_graph of <class 'returnn.tf.util.basic.FetchHelper'>> target_op =  op =  <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc> fetch_helper_tensors =  list =  <class 'list'> op.inputs =  (<tf.Tensor 'extern_data/placeholders/seq_tag/seq_tag:0' shape=(?,) dtype=string>,) stop_at_ts =  [<tf.Tensor 'extern_data/placeholders/data/data:0' shape=(?, ?, 1) dtype=float32>, <tf.Tensor 'extern_data/placeholders/seq_tag/seq_tag:0' shape=(?,) dtype=string>, <tf.Tensor 'extern_data/placeholders/data/data_dim0_size:0' shape=(?,) dtype=int32>, <tf.Tensor 'extern_data/placeholders/batch_dim:... verbose_stream =  file =  <returnn.log.Stream object at 0x7f4646730e50> File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/util/basic.py", line 7700, in FetchHelper.copy_graph line: assert target_op in ops, "target_op %r,\nops\n%s" % (target_op, pformat(ops)) locals: target_op =  <tf.Operation 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' type=PyFunc> ops =  [<tf.Operation 'extern_data/placeholders/seq_tag/seq_tag' type=Placeholder>] pformat =  <function pformat at 0x7f464aa7ec10> AssertionError: target_op , ops [] Step meta information: {'seq_idx': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38], 'seq_tag': ['switchboard-1/sw02721B/sw2721B-ms98-a-0031', 'switchboard-1/sw02427A/sw2427A-ms98-a-0021', 'switchboard-1/sw02848B/sw2848B-ms98-a-0086', 'switchboard-1/sw04037A/sw4037A-ms98-a-0027', 'switchboard-1/sw02370B/sw2370B-ms98-a-0117', 'switchboard-1/sw02145A/sw2145A-ms98-a-0107', 'switchboard-1/sw02484A/sw2484A-ms98-a-0077', 'switchboard-1/sw02768A/sw2768A-ms98-a-0064', 'switchboard-1/sw03312B/sw3312B-ms98-a-0041', 'switchboard-1/sw02344B/sw2344B-ms98-a-0023', 'switchboard-1/sw04248B/sw4248B-ms98-a-0017', 'switchboard-1/sw02762A/sw2762A-ms98-a-0059', 'switchboard-1/sw03146A/sw3146A-ms98-a-0047', 'switchboard-1/sw03032A/sw3032A-ms98-a-0065', 'switchboard-1/sw02288A/sw2288A-ms98-a-0080', 'switchboard-1/sw02751A/sw2751A-ms98-a-0066', 'switchboard-1/sw02369A/sw2369A-ms98-a-0118', 'switchboard-1/sw04169A/sw4169A-ms98-a-0059', 'switchboard-1/sw02227A/sw2227A-ms98-a-0016', 'switchboard-1/sw02061B/sw2061B-ms98-a-0170', 'switchboard-1/sw02862B/sw2862B-ms98-a-0033', 'switchboard-1/sw03116B/sw3116B-ms98-a-0065', 'switchboard-1/sw03517B/sw3517B-ms98-a-0038', 'switchboard-1/sw02360B/sw2360B-ms98-a-0086', 'switchboard-1/sw02510B/sw2510B-ms98-a-0061', 'switchboard-1/sw03919A/sw3919A-ms98-a-0017', 'switchboard-1/sw02965A/sw2965A-ms98-a-0045', 'switchboard-1/sw03154A/sw3154A-ms98-a-0073', 'switchboard-1/sw02299A/sw2299A-ms98-a-0005', 'switchboard-1/sw04572A/sw4572A-ms98-a-0026', 'switchboard-1/sw02682A/sw2682A-ms98-a-0022', 'switchboard-1/sw02808A/sw2808A-ms98-a-0014', 'switchboard-1/sw04526A/sw4526A-ms98-a-0026', 'switchboard-1/sw03180B/sw3180B-ms98-a-0010', 'switchboard-1/sw03227A/sw3227A-ms98-a-0029', 'switchboard-1/sw03891B/sw3891B-ms98-a-0008', 'switchboard-1/sw03882B/sw3882B-ms98-a-0041', 'switchboard-1/sw03102B/sw3102B-ms98-a-0027', 'switchboard-1/sw02454A/sw2454A-ms98-a-0029']} Feed dict: : int(39) : shape (39, 10208, 1), dtype float32, min/max -1.0/1.0, mean/stddev 0.0014351769/0.11459725, Tensor{'data', [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]} : shape (39,), dtype int32, min/max 4760/10208, ([ 4760 6246 6372 6861 7296 7499 7534 7622 7824 8031 8295 8431 8690 8675 8667 8886 9084 9199 9163 9156 9274 9262 9540 9668 9678 9719 9711 9902 9989 10010 10020 10073 10006 10102 10131 10112 10130 10178 10208]) : type , Tensor{'seq_tag', [B?], dtype='string'} : bool(True) Save model under output/models/epoch.001.crash_0 Trainer not finalized, quitting. (pid 356353) ```
vieting commented 12 months ago

I get the same error when using a tf 2.14 image and RASR compiled using that image.

albertz commented 12 months ago

Is that the original stdout + stderr, or just the log?

It looks a bit like maybe RASR does not correctly starts at all? You should e.g. see this then on stdout:

print("RETURNN SprintControl[pid %i] Python module load" % os.getpid())

And then:

    print(
        (
            "RETURNN SprintControl[pid %i] init: "
            "name=%r, sprint_unit=%r, version_number=%r, callback=%r, ref=%r, config=%r, kwargs=%r"
        )
        % (os.getpid(), name, sprint_unit, version_number, callback, reference, config, kwargs)
    )

If you don't see that, then my recent fixes, and also Tinas patch are not really related to your issue at all.

You should check the RASR log then. There should be some error by RASR, probably Python related, maybe sth like that it could not load the module or so. Maybe some import missing.

vieting commented 12 months ago

What I posted before was from the log. The following is copied from stdout and stderr (with tf 2.14 image, also for RASR compilation):

``` vieting@cn-251:/work/asr4/vieting/tmp/20231108_tf213_sprint_op$ ./run_example_rasr_tf214.sh RETURNN starting up, version 1.20231108.140626+git.9fe93590.dirty, date/time 2023-11-08-16-43-54 (UTC+0100), pid 2130233, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3 RETURNN command line options: ['returnn.tf214.config'] Hostname: cn-251 2023-11-08 16:44:01.024863: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-11-08 16:44:01.024944: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-11-08 16:44:01.034051: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2023-11-08 16:44:02.271356: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. TensorFlow: 2.14.0 (v2.14.0-rc1-21-g4dacf3f368e) ( in /usr/local/lib/python3.11/dist-packages/tensorflow) Use num_threads=1 (but min 2) via OMP_NUM_THREADS. Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}. CUDA_VISIBLE_DEVICES is set to '2'. Collecting TensorFlow device list... 2023-11-08 16:44:23.424846: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /device:GPU:0 with 10396 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1 Local devices available to TensorFlow: 1/2: name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 11581945563073303627 xla_global_id: -1 2/2: name: "/device:GPU:0" device_type: "GPU" memory_limit: 10901061632 locality { bus_id: 2 numa_node: 1 links { } } incarnation: 1815047742352363074 physical_device_desc: "device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1" xla_global_id: 416903419 Using gpu device 2: NVIDIA GeForce GTX 1080 Ti Hostname 'cn-251', GPU 2, GPU-dev-name 'NVIDIA GeForce GTX 1080 Ti', GPU-memory 10.2GB Train data: input: 1 x 1 output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]} OggZipDataset, sequences: 249229, frames: unknown Dev data: OggZipDataset, sequences: 300, frames: unknown Learning-rate-control: file learning_rates.swb.ctc does not exist yet Setup TF session with options {'log_device_placement': False, 'device_count': {'GPU': 1}} ... 2023-11-08 16:44:31.951062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10396 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1 layer /'data': [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)] float32 layer /features/'conv_h_filter': ['conv_h_filter:static:0'(128),'conv_h_filter:static:1'(1),F|F'conv_h_filter:static:2'(150)] float32 layer /features/'conv_h': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_act': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_split': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F'conv_h:channel'(150),F|F'conv_h_split_split_dims1'(1)] float32 DEPRECATION WARNING: Explicitly specify in_spatial_dims when there is more than one spatial dim in the input. This will be disallowed with behavior_version 8. layer /features/'conv_l': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel'(150),F|F'conv_l:channel'(5)] float32 layer /features/'conv_l_merge': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 DEPRECATION WARNING: MergeDimsLayer, only keep_order=True is allowed This will be disallowed with behavior_version 6. layer /features/'conv_l_act_no_norm': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /features/'conv_l_act': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /features/'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'features': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'specaug': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'conv_source': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_source_split_dims1'(1)] float32 layer /'conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_1:channel'(32)] float32 WARNING:tensorflow:From /work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/util/basic.py:1723: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor layer /'conv_1_pool': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_1:channel'(32)] float32 layer /'conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/32⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_2:channel'(64)] float32 layer /'conv_3': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_3:channel'(64)] float32 layer /'conv_merged': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conv_h:channel*conv_l:channel//2)*conv_3:channel'(24000)] float32 layer /'input_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'encoder': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 2023-11-08 16:44:32.241797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10396 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1 layer /'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'output:feature-dense'(88)] float32 WARNING:tensorflow:From /work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py:54: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, there are two options available in V2. - tf.py_function takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors means `tf.py_function`s can use accelerators such as GPUs as well as being differentiable using a gradient tape. - tf.numpy_function maintains the semantics of the deprecated tf.py_func (it is not differentiable, and manipulates numpy arrays). It drops the stateful argument making all functions stateful. Network layer topology: extern data: data: Tensor{[B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]}, seq_tag: Tensor{[B?], dtype='string'} used data keys: ['data', 'seq_tag'] layers: layer conv 'conv_1' #: 32 layer pool 'conv_1_pool' #: 32 layer conv 'conv_2' #: 64 layer conv 'conv_3' #: 64 layer merge_dims 'conv_merged' #: 24000 layer split_dims 'conv_source' #: 1 layer source 'data' #: 1 layer copy 'encoder' #: 512 layer subnetwork 'features' #: 750 layer conv 'features/conv_h' #: 150 layer eval 'features/conv_h_act' #: 150 layer variable 'features/conv_h_filter' #: 150 layer split_dims 'features/conv_h_split' #: 1 layer conv 'features/conv_l' #: 5 layer layer_norm 'features/conv_l_act' #: 750 layer eval 'features/conv_l_act_no_norm' #: 750 layer merge_dims 'features/conv_l_merge' #: 750 layer copy 'features/output' #: 750 layer linear 'input_linear' #: 512 layer softmax 'output' #: 88 layer copy 'specaug' #: 750 net params #: 12409788 net trainable params: [, , , , , , , , , , , , ] 2023-11-08 16:44:34.658621: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled start training at epoch 1 using batch size: {'classes': 5000, 'data': 400000}, max seqs: 128 learning rate control: NewbobMultiEpoch(num_epochs=6, update_interval=1, relative_error_threshold=-0.01, relative_error_grow_threshold=-0.01), epoch data: 1: EpochData(learningRate=1.325e-05, error={}), 2: EpochData(learningRate=1.539861111111111e-05, error={}), 3: EpochData(learningRate=1.754722222222222e-05, error={}), ..., 360: EpochData(learningRate=1.4333333333333375e-05, error={}), 361: EpochData(learningRate=1.2166666666666727e-05, error={}), 362: EpochData(learningRate=1e-05, error={}), error key: None pretrain: None start epoch 1 with learning rate 1.325e-05 ... TF: log_dir: output/models/train-2023-11-08-15-43-53 Create optimizer with options {'epsilon': 1e-08, 'learning_rate': }. Initialize optimizer (default) with slots ['m', 'v']. These additional variable were created by the optimizer: [, , , , , , , , , , , , , , ]. 2023-11-08 16:44:39.517531: W tensorflow/c/c_api.cc:305] Operation '{name:'global_step' id:357 op device:{requested: '/device:CPU:0', assigned: ''} def:{{{node global_step}} = VarHandleOp[_class=["loc:@global_step"], _has_manual_control_dependencies=true, allowed_devices=[], container="", dtype=DT_INT64, shape=[], shared_name="global_step", _device="/device:CPU:0"]()}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session. SprintSubprocessInstance: exec ['/work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:35,p2c_fd:36,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=yes', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1'] SprintSubprocessInstance: starting, pid 2130824 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard: Relink `/usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2' with `/lib/x86_64-linux-gnu/libz.so.1' for IFUNC symbol `crc32_z' 2023-11-08 16:44:43.478818: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-11-08 16:44:43.478967: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-11-08 16:44:43.479063: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered configuration error: failed to open file "neural-network-trainer.config" for reading. (No such file or directory) RETURNN SprintControl[pid 2130824] Python module load RETURNN SprintControl[pid 2130824] init: name='Sprint.PythonControl', sprint_unit='NnTrainer.pythonControl', version_number=5, callback=, ref=, config={'c2p_fd': '35', 'p2c_fd': '36', 'minPythonControlVersion': '4'}, kwargs={} RETURNN SprintControl[pid 2130824] PythonControl create {'c2p_fd': 35, 'p2c_fd': 36, 'name': 'Sprint.PythonControl', 'reference': , 'config': {'c2p_fd': '35', 'p2c_fd': '36', 'minPythonControlVersion': '4'}, 'sprint_unit': 'NnTrainer.pythonControl', 'version_number': 5, 'min_version_number': 4, 'callback': } RETURNN SprintControl[pid 2130824] PythonControl init {'name': 'Sprint.PythonControl', 'reference': , 'config': {'c2p_fd': '35', 'p2c_fd': '36', 'minPythonControlVersion': '4'}, 'sprint_unit': 'NnTrainer.pythonControl', 'version_number': 5, 'min_version_number': 4, 'callback': } RETURNN SprintControl[pid 2130824] init for Sprint.PythonControl {'reference': , 'config': {'c2p_fd': '35', 'p2c_fd': '36', 'minPythonControlVersion': '4'}} RETURNN SprintControl[pid 2130824] PythonControl run_control_loop: , {} RETURNN SprintControl[pid 2130824] PythonControl run_control_loop control: 'RWTH ASR 0.9beta (431c74d54b895a2a4c3689bcd5bf641a878bb925)\n' SprintSubprocessInstance: exec ['/work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:36,p2c_fd:38,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=yes', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1'] SprintSubprocessInstance: starting, pid 2130845 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard: Relink `/usr/local/lib/python3.11/dist-packages/tensorflow/libtensorflow_framework.so.2' with `/lib/x86_64-linux-gnu/libz.so.1' for IFUNC symbol `crc32_z' 2023-11-08 16:44:44.788087: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered 2023-11-08 16:44:44.788217: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered 2023-11-08 16:44:44.788276: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered configuration error: failed to open file "neural-network-trainer.config" for reading. (No such file or directory) RETURNN SprintControl[pid 2130845] Python module load RETURNN SprintControl[pid 2130845] init: name='Sprint.PythonControl', sprint_unit='NnTrainer.pythonControl', version_number=5, callback=, ref=, config={'c2p_fd': '36', 'p2c_fd': '38', 'minPythonControlVersion': '4'}, kwargs={} RETURNN SprintControl[pid 2130845] PythonControl create {'c2p_fd': 36, 'p2c_fd': 38, 'name': 'Sprint.PythonControl', 'reference': , 'config': {'c2p_fd': '36', 'p2c_fd': '38', 'minPythonControlVersion': '4'}, 'sprint_unit': 'NnTrainer.pythonControl', 'version_number': 5, 'min_version_number': 4, 'callback': } RETURNN SprintControl[pid 2130845] PythonControl init {'name': 'Sprint.PythonControl', 'reference': , 'config': {'c2p_fd': '36', 'p2c_fd': '38', 'minPythonControlVersion': '4'}, 'sprint_unit': 'NnTrainer.pythonControl', 'version_number': 5, 'min_version_number': 4, 'callback': } RETURNN SprintControl[pid 2130845] init for Sprint.PythonControl {'reference': , 'config': {'c2p_fd': '36', 'p2c_fd': '38', 'minPythonControlVersion': '4'}} RETURNN SprintControl[pid 2130845] PythonControl run_control_loop: , {} RETURNN SprintControl[pid 2130845] PythonControl run_control_loop control: 'RWTH ASR 0.9beta (431c74d54b895a2a4c3689bcd5bf641a878bb925)\n' 2023-11-08 16:45:03.663421: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:442] Loaded cuDNN version 8600 Fatal Python error: Segmentation fault Current thread 0x00007f69453ea380 (most recent call first): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/control.py", line 499 in _handle_cmd_export_allophone_state_fsa_by_segment_name File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/control.py", line 509 in _handle_cmd File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/control.py", line 524 in handle_next File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/control.py", line 550 in run_control_loop Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.h5r, h5py.utils, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5t, h5py._conv, h5py.h5z, h5py._proxy, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector (total: 37) PROGRAM DEFECTIVE (TERMINATED BY SIGNAL): Segmentation fault Creating stack trace (innermost first): #2 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f6947720520] #3 /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c) [0x7f69477749fc] #4 /lib/x86_64-linux-gnu/libc.so.6(raise+0x16) [0x7f6947720476] #5 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f6947720520] #6 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZNK3Ftl13TrimAutomatonIN3Fsa9AutomatonEE8getStateEj+0x3a) [0x55d2626e440a] #7 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZNK3Ftl14CacheAutomatonIN3Fsa9AutomatonEE8getStateEj+0x3a2) [0x55d2626f3c72] #8 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(+0x9fb257) [0x55d262675257] #9 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(+0x9fe9ac) [0x55d2626789ac] #10 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZNK2Am15TransitionModel5applyEN4Core3RefIKN3Fsa9AutomatonEEEib+0x274) [0x55d262671194] #11 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN2Am24ClassicTransducerBuilder20applyTransitionModelEN4Core3RefIKN3Fsa9AutomatonEEE+0x387) [0x55d262660df7] #12 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Speech26AllophoneStateGraphBuilder17addLoopTransitionEN4Core3RefIKN3Fsa9AutomatonEEE+0x123) [0x55d262482e43] #13 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Speech23CTCTopologyGraphBuilder17addLoopTransitionEN4Core3RefIKN3Fsa9AutomatonEEE+0x53) [0x55d262483183] #14 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Speech23CTCTopologyGraphBuilder15buildTransducerEN4Core3RefIKN3Fsa9AutomatonEEE+0x8f) [0x55d262485cbf] #15 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Speech26AllophoneStateGraphBuilder15buildTransducerERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x66) [0x55d262480516] #16 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Speech26AllophoneStateGraphBuilder5buildERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x2e) [0x55d262480d5e] #17 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZNK2Nn25AllophoneStateFsaExporter23exportFsaForOrthographyERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x54) [0x55d262359054] #18 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN2Nn13PythonControl8Internal32exportAllophoneStateFsaBySegNameEP7_objectS3_+0x133) [0x55d26233e833] #19 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN2Nn13PythonControl8Internal8callbackEP7_objectS3_+0x25d) [0x55d26233ee6d] #20 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x1cd073) [0x7f697baa0073] #21 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(_PyObject_MakeTpCall+0x87) [0x7f697ba50ff7] #22 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x477a) [0x7f697b9de96a] #23 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x26bf9a) [0x7f697bb3ef9a] #24 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x181058) [0x7f697ba54058] #25 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x50ae) [0x7f697b9df29e] #26 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x26bf9a) [0x7f697bb3ef9a] #27 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x181058) [0x7f697ba54058] #28 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x50ae) [0x7f697b9df29e] #29 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x26bf9a) [0x7f697bb3ef9a] #30 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x1810d8) [0x7f697ba540d8] #31 /lib/x86_64-linux-gnu/libpython3.11.so.1.0(_PyObject_Call+0x128) [0x7f697ba53b88] #32 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Python8PyCallKwEP7_objectPKcS3_z+0xe6) [0x55d26258c876] #33 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN2Nn13PythonControl16run_control_loopEv+0x5f) [0x55d262332fbf] #34 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN9NnTrainer13pythonControlEv+0x167) [0x55d2620df317] #35 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN9NnTrainer4mainERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS6_EE+0x303) [0x55d2620b8e13] #36 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN4Core11Application3runERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x23) [0x55d26211e413] #37 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN4Core11Application4mainEiPPc+0x577) [0x55d2620ba577] #38 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(main+0x3d) [0x55d2620b852d] #39 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6947707d90] #40 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6947707e40] #41 /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_start+0x25) [0x55d2620dd7a5] Exception in py_wrap_get_sprint_automata_for_batch: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in get_sprint_automata_for_batch_op..py_wrap_get_sprint_automata_for_batch line: return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) locals: py_get_sprint_automata_for_batch = sprint_opts = {'sprintExecPath': '/work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', 'sprintConfigStr': '--*.configuration.channel=output-channel --*.real-time-factor.channel=output-channel --*.system-info.channel=output-channel --*.time.channel=output-channel --*.version.... tags = py_tags = array([b'switchboard-1/sw02721B/sw2721B-ms98-a-0031', b'switchboard-1/sw02427A/sw2427A-ms98-a-0021', b'switchboard-1/sw02848B/sw2848B-ms98-a-0086', b'switchboard-1/sw04037A/sw4037A-ms98-a-0027', b'switchboard-1/sw02370B/sw2370B-ms98-a-0117', b'switchboard-1/sw02... File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch line: edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) locals: edges = weights = start_end_states = sprint_instance_pool = sprint_instance_pool.get_automata_for_batch = > tags = array([b'switchboard-1/sw02721B/sw2721B-ms98-a-0031', b'switchboard-1/sw02427A/sw2427A-ms98-a-0021', b'switchboard-1/sw02848B/sw2848B-ms98-a-0086', b'switchboard-1/sw04037A/sw4037A-ms98-a-0027', b'switchboard-1/sw02370B/sw2370B-ms98-a-0117', b'switchboard-1/sw02... File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in SprintInstancePool.get_automata_for_batch line: r = instance._read() locals: r = ('ok', 9, 22, array([ 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 0, 2, 4, 6, 7, 5, 6, 4, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 2, 4, 6, 8, 8, 8, 8, 8, 0, 6, 0, 22, 0, 48, 0, 0, 6, 0, 22, 0, 48, 0, 6, 22, 48, 48, 0, 48,... instance = instance._read = > File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in SprintSubprocessInstance._read line: return util.read_pickled_object(p) locals: util = util.read_pickled_object = p = <_io.FileIO name=35 mode='rb' closefd=True> File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object line: size_raw = read_bytes_to_new_buffer(p, 4).getvalue() locals: size_raw = read_bytes_to_new_buffer = p = <_io.FileIO name=35 mode='rb' closefd=True> getvalue = File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer line: raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) locals: EOFError = size = 4 read_size = 0 EOFError: expected to read 4 bytes but got EOF after 0 bytes 2023-11-08 16:45:06.805151: W tensorflow/core/framework/op_kernel.cc:1827] UNKNOWN: EOFError: expected to read 4 bytes but got EOF after 0 bytes Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__ ret = func(*args) ^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in get_automata_for_batch r = instance._read() ^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes 2023-11-08 16:45:06.805314: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4669204044388377120 2023-11-08 16:45:06.805394: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 14394728958513161507 2023-11-08 16:45:06.805423: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4611900397994247129 2023-11-08 16:45:06.805450: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 11246935140361182411 2023-11-08 16:45:06.805476: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 3527483492372743068 2023-11-08 16:45:06.805500: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 455321662105441778 2023-11-08 16:45:06.805527: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 4997316685218163964 2023-11-08 16:45:06.805550: I tensorflow/core/framework/local_rendezvous.cc:421] Local rendezvous recv item cancelled. Key hash: 11970666840078253952 TensorFlow exception: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/./returnn/rnn.py", line 11, in File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/./returnn/rnn.py", line 11, in File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: EOFError: expected to read 4 bytes but got EOF after 0 bytes Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__ ret = func(*args) ^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in get_automata_for_batch r = instance._read() ^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_127]] (1) UNKNOWN: EOFError: expected to read 4 bytes but got EOF after 0 bytes Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__ ret = func(*args) ^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in get_automata_for_batch r = instance._read() ^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/./returnn/rnn.py", line 11, in File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/util/deprecation.py", line 383, in new_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 798, in py_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 773, in py_func_common File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 380, in _internal_py_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/ops.py", line 2657, in _create_op_internal File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/ops.py", line 1161, in from_node_def Exception UnknownError() in step 0. (pid 2130233) Failing op: We tried to fetch the op inputs ([]) but got another exception: target_op , ops [] EXCEPTION Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1402, in BaseSession._do_call line: return fn(*args) locals: fn = ._run_fn at 0x7ff04bb38860> args = ({: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.00... File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1385, in BaseSession._do_run.._run_fn line: return self._call_tf_sessionrun(options, feed_dict, fetch_list, target_list, run_metadata) locals: self = self._call_tf_sessionrun = > options = None feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list = [, , , [] run_metadata = None File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1478, in BaseSession._call_tf_sessionrun line: return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, fetch_list, target_list, run_metadata) locals: tf_session = tf_session.TF_SessionRun_wrapper = self = self._session = options = None feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list = [, , , [] run_metadata = None UnknownError: 2 root error(s) found. (0) UNKNOWN: EOFError: expected to read 4 bytes but got EOF after 0 bytes Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__ ret = func(*args) ^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in get_automata_for_batch r = instance._read() ^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_127]] (1) UNKNOWN: EOFError: expected to read 4 bytes but got EOF after 0 bytes Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__ ret = func(*args) ^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in get_automata_for_batch r = instance._read() ^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 744, in Runner.run line: fetches_results = sess.run( fetches_dict, feed_dict=feed_dict, options=run_options ) # type: typing.Dict[str,typing.Union[numpy.ndarray,str]] locals: fetches_results = sess = sess.run = > fetches_dict = {'size:data:0': , 'loss': , 'cost:output': , 'loss_norm_..., len = 7 feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options = run_options = None File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 972, in BaseSession.run line: result = self._run(None, fetches, feed_dict, options_ptr, run_metadata_ptr) locals: result = self = self._run = > fetches = {'size:data:0': , 'loss': , 'cost:output': , 'loss_norm_..., len = 7 feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options_ptr = None run_metadata_ptr = None File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1215, in BaseSession._run line: results = self._do_run(handle, final_targets, final_fetches, feed_dict_tensor, options, run_metadata) locals: results = self = self._do_run = > handle = None final_targets = [] final_fetches = [, , , {>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049... options = None run_metadata = None File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1395, in BaseSession._do_run line: return self._do_call(_run_fn, feeds, fetches, targets, options, run_metadata) locals: self = self._do_call = > _run_fn = ._run_fn at 0x7ff04bb38860> feeds = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetches = [, , , [] options = None run_metadata = None File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1421, in BaseSession._do_call line: raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter locals: type = e = node_def = name: "objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch" op: "PyFunc" input: "extern_data/placeholders/seq_tag/seq_tag" attr { key: "token" value { s: "pyfunc_0" } } attr { key: "Tout" value { list { type: DT_INT32 type: DT_FLOAT type: DT_INT... op = message = 'Graph execution error:\n\nDetected at node \'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch\' defined at (most recent call last):\n File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/./returnn/rnn.py", line 11, in \n File "/work/asr4/vieting/tmp/20231108_tf2..., len = 8772 UnknownError: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/./returnn/rnn.py", line 11, in File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/./returnn/rnn.py", line 11, in File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: EOFError: expected to read 4 bytes but got EOF after 0 bytes Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__ ret = func(*args) ^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in get_automata_for_batch r = instance._read() ^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_127]] (1) UNKNOWN: EOFError: expected to read 4 bytes but got EOF after 0 bytes Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__ ret = func(*args) ^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in get_automata_for_batch r = instance._read() ^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/./returnn/rnn.py", line 11, in File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/util/deprecation.py", line 383, in new_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 798, in py_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 773, in py_func_common File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 380, in _internal_py_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/ops.py", line 2657, in _create_op_internal File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/ops.py", line 1161, in from_node_def During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4341, in help_on_tf_exception line: debug_fetch, fetch_helpers, op_copied = FetchHelper.copy_graph( debug_fetch, target_op=op, fetch_helper_tensors=list(op.inputs), stop_at_ts=stop_at_ts, verbose_stream=file, ) locals: debug_fetch = fetch_helpers = op_copied = FetchHelper = FetchHelper.copy_graph = > target_op = op = fetch_helper_tensors = list = op.inputs = (,) stop_at_ts = [, , , file = File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/util/basic.py", line 7700, in FetchHelper.copy_graph line: assert target_op in ops, "target_op %r,\nops\n%s" % (target_op, pformat(ops)) locals: target_op = ops = [] pformat = AssertionError: target_op , ops [] Step meta information: {'seq_idx': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38], 'seq_tag': ['switchboard-1/sw02721B/sw2721B-ms98-a-0031', 'switchboard-1/sw02427A/sw2427A-ms98-a-0021', 'switchboard-1/sw02848B/sw2848B-ms98-a-0086', 'switchboard-1/sw04037A/sw4037A-ms98-a-0027', 'switchboard-1/sw02370B/sw2370B-ms98-a-0117', 'switchboard-1/sw02145A/sw2145A-ms98-a-0107', 'switchboard-1/sw02484A/sw2484A-ms98-a-0077', 'switchboard-1/sw02768A/sw2768A-ms98-a-0064', 'switchboard-1/sw03312B/sw3312B-ms98-a-0041', 'switchboard-1/sw02344B/sw2344B-ms98-a-0023', 'switchboard-1/sw04248B/sw4248B-ms98-a-0017', 'switchboard-1/sw02762A/sw2762A-ms98-a-0059', 'switchboard-1/sw03146A/sw3146A-ms98-a-0047', 'switchboard-1/sw03032A/sw3032A-ms98-a-0065', 'switchboard-1/sw02288A/sw2288A-ms98-a-0080', 'switchboard-1/sw02751A/sw2751A-ms98-a-0066', 'switchboard-1/sw02369A/sw2369A-ms98-a-0118', 'switchboard-1/sw04169A/sw4169A-ms98-a-0059', 'switchboard-1/sw02227A/sw2227A-ms98-a-0016', 'switchboard-1/sw02061B/sw2061B-ms98-a-0170', 'switchboard-1/sw02862B/sw2862B-ms98-a-0033', 'switchboard-1/sw03116B/sw3116B-ms98-a-0065', 'switchboard-1/sw03517B/sw3517B-ms98-a-0038', 'switchboard-1/sw02360B/sw2360B-ms98-a-0086', 'switchboard-1/sw02510B/sw2510B-ms98-a-0061', 'switchboard-1/sw03919A/sw3919A-ms98-a-0017', 'switchboard-1/sw02965A/sw2965A-ms98-a-0045', 'switchboard-1/sw03154A/sw3154A-ms98-a-0073', 'switchboard-1/sw02299A/sw2299A-ms98-a-0005', 'switchboard-1/sw04572A/sw4572A-ms98-a-0026', 'switchboard-1/sw02682A/sw2682A-ms98-a-0022', 'switchboard-1/sw02808A/sw2808A-ms98-a-0014', 'switchboard-1/sw04526A/sw4526A-ms98-a-0026', 'switchboard-1/sw03180B/sw3180B-ms98-a-0010', 'switchboard-1/sw03227A/sw3227A-ms98-a-0029', 'switchboard-1/sw03891B/sw3891B-ms98-a-0008', 'switchboard-1/sw03882B/sw3882B-ms98-a-0041', 'switchboard-1/sw03102B/sw3102B-ms98-a-0027', 'switchboard-1/sw02454A/sw2454A-ms98-a-0029']} Feed dict: : int(39) : shape (39, 10208, 1), dtype float32, min/max -1.0/1.0, mean/stddev 0.0014351769/0.11459725, Tensor{'data', [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]} : shape (39,), dtype int32, min/max 4760/10208, ([ 4760 6246 6372 6861 7296 7499 7534 7622 7824 8031 8295 8431 8690 8675 8667 8886 9084 9199 9163 9156 9274 9262 9540 9668 9678 9719 9711 9902 9989 10010 10020 10073 10006 10102 10131 10112 10130 10178 10208]) : type , Tensor{'seq_tag', [B?], dtype='string'} : bool(True) EXCEPTION Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1402, in BaseSession._do_call line: return fn(*args) locals: fn = ._run_fn at 0x7ff04bb38860> args = ({: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.00... File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1385, in BaseSession._do_run.._run_fn line: return self._call_tf_sessionrun(options, feed_dict, fetch_list, target_list, run_metadata) locals: self = self._call_tf_sessionrun = > options = None feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list = [, , , [] run_metadata = None File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1478, in BaseSession._call_tf_sessionrun line: return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, fetch_list, target_list, run_metadata) locals: tf_session = tf_session.TF_SessionRun_wrapper = self = self._session = options = None feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list = [, , , [] run_metadata = None UnknownError: 2 root error(s) found. (0) UNKNOWN: EOFError: expected to read 4 bytes but got EOF after 0 bytes Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__ ret = func(*args) ^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in get_automata_for_batch r = instance._read() ^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_127]] (1) UNKNOWN: EOFError: expected to read 4 bytes but got EOF after 0 bytes Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__ ret = func(*args) ^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in get_automata_for_batch r = instance._read() ^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 744, in Runner.run line: fetches_results = sess.run( fetches_dict, feed_dict=feed_dict, options=run_options ) # type: typing.Dict[str,typing.Union[numpy.ndarray,str]] locals: fetches_results = sess = sess.run = > fetches_dict = {'size:data:0': , 'loss': , 'cost:output': , 'loss_norm_..., len = 7 feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options = run_options = None File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 972, in BaseSession.run line: result = self._run(None, fetches, feed_dict, options_ptr, run_metadata_ptr) locals: result = self = self._run = > fetches = {'size:data:0': , 'loss': , 'cost:output': , 'loss_norm_..., len = 7 feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options_ptr = None run_metadata_ptr = None File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1215, in BaseSession._run line: results = self._do_run(handle, final_targets, final_fetches, feed_dict_tensor, options, run_metadata) locals: results = self = self._do_run = > handle = None final_targets = [] final_fetches = [, , , {>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049... options = None run_metadata = None File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1395, in BaseSession._do_run line: return self._do_call(_run_fn, feeds, fetches, targets, options, run_metadata) locals: self = self._do_call = > _run_fn = ._run_fn at 0x7ff04bb38860> feeds = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetches = [, , , [] options = None run_metadata = None File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/client/session.py", line 1421, in BaseSession._do_call line: raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter locals: type = e = node_def = name: "objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch" op: "PyFunc" input: "extern_data/placeholders/seq_tag/seq_tag" attr { key: "token" value { s: "pyfunc_0" } } attr { key: "Tout" value { list { type: DT_INT32 type: DT_FLOAT type: DT_INT... op = message = 'Graph execution error:\n\nDetected at node \'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch\' defined at (most recent call last):\n File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/./returnn/rnn.py", line 11, in \n File "/work/asr4/vieting/tmp/20231108_tf2..., len = 8772 UnknownError: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/./returnn/rnn.py", line 11, in File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/./returnn/rnn.py", line 11, in File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: EOFError: expected to read 4 bytes but got EOF after 0 bytes Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__ ret = func(*args) ^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in get_automata_for_batch r = instance._read() ^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_127]] (1) UNKNOWN: EOFError: expected to read 4 bytes but got EOF after 0 bytes Traceback (most recent call last): File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 270, in __call__ ret = func(*args) ^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 528, in get_automata_for_batch r = instance._read() ^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/./returnn/rnn.py", line 11, in File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/util/deprecation.py", line 383, in new_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/util/dispatch.py", line 1260, in op_dispatch_handler File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 798, in py_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 773, in py_func_common File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/script_ops.py", line 380, in _internal_py_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/op_def_library.py", line 796, in _apply_op_helper File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/ops.py", line 2657, in _create_op_internal File "/usr/local/lib/python3.11/dist-packages/tensorflow/python/framework/ops.py", line 1161, in from_node_def Save model under output/models/epoch.001.crash_0 Trainer not finalized, quitting. (pid 2130233) SprintSubprocessInstance: interrupt child proc 2130824 ```
vieting commented 12 months ago

The RASR log of the nn trainer does not contain anything that looks particularly suspicious to me.

albertz commented 12 months ago

What about this?

configuration error: failed to open file "neural-network-trainer.config" for reading. (No such file or directory)
albertz commented 12 months ago

And in your stdout, you see the actual error:

Fatal Python error: Segmentation fault

Current thread 0x00007f69453ea380 (most recent call first):
  File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/control.py", line 499 in _handle_cmd_export_allophone_state_fsa_by_segment_name
  File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/control.py", line 509 in _handle_cmd
  File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/control.py", line 524 in handle_next
  File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/control.py", line 550 in run_control_loop

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, h5py._errors, h5py.defs, h5py._objects, h5py.h5, h5py.h5r, h5py.utils, h5py.h5s, h5py.h5ac, h5py.h5p, h5py.h5t, h5py._conv, h5py.h5z, h5py._proxy, h5py.h5a, h5py.h5d, h5py.h5ds, h5py.h5g, h5py.h5i, h5py.h5f, h5py.h5fd, h5py.h5pl, h5py.h5o, h5py.h5l, h5py._selector (total: 37)
<?xml version="1.0" encoding="UTF-8"?>
<sprint>
<?xml version="1.0" encoding="UTF-8"?>
<sprint>

  PROGRAM DEFECTIVE (TERMINATED BY SIGNAL):
  Segmentation fault

  Creating stack trace (innermost first):
  #2  /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f6947720520]
  #3  /lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c) [0x7f69477749fc]
  #4  /lib/x86_64-linux-gnu/libc.so.6(raise+0x16) [0x7f6947720476]
  #5  /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x7f6947720520]
  #6  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZNK3Ftl13TrimAutomatonIN3Fsa9AutomatonEE8getStateEj+0x3a) [0x55d2626e440a]
  #7  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZNK3Ftl14CacheAutomatonIN3Fsa9AutomatonEE8getStateEj+0x3a2) [0x55d2626f3c72]
  #8  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(+0x9fb257) [0x55d262675257]
  #9  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(+0x9fe9ac) [0x55d2626789ac]
  #10  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZNK2Am15TransitionModel5applyEN4Core3RefIKN3Fsa9AutomatonEEEib+0x274) [0x55d262671194]
  #11  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN2Am24ClassicTransducerBuilder20applyTransitionModelEN4Core3RefIKN3Fsa9AutomatonEEE+0x387) [0x55d262660df7]
  #12  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Speech26AllophoneStateGraphBuilder17addLoopTransitionEN4Core3RefIKN3Fsa9AutomatonEEE+0x123) [0x55d262482e43]
  #13  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Speech23CTCTopologyGraphBuilder17addLoopTransitionEN4Core3RefIKN3Fsa9AutomatonEEE+0x53) [0x55d262483183]
  #14  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Speech23CTCTopologyGraphBuilder15buildTransducerEN4Core3RefIKN3Fsa9AutomatonEEE+0x8f) [0x55d262485cbf]
  #15  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Speech26AllophoneStateGraphBuilder15buildTransducerERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x66) [0x55d262480516]
  #16  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Speech26AllophoneStateGraphBuilder5buildERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x2e) [0x55d262480d5e]
  #17  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZNK2Nn25AllophoneStateFsaExporter23exportFsaForOrthographyERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x54) [0x55d262359054]
  #18  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN2Nn13PythonControl8Internal32exportAllophoneStateFsaBySegNameEP7_objectS3_+0x133) [0x55d26233e833]
  #19  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN2Nn13PythonControl8Internal8callbackEP7_objectS3_+0x25d) [0x55d26233ee6d]
  #20  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x1cd073) [0x7f697baa0073]
  #21  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(_PyObject_MakeTpCall+0x87) [0x7f697ba50ff7]
  #22  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x477a) [0x7f697b9de96a]
  #23  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x26bf9a) [0x7f697bb3ef9a]
  #24  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x181058) [0x7f697ba54058]
  #25  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x50ae) [0x7f697b9df29e]
  #26  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x26bf9a) [0x7f697bb3ef9a]
  #27  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x181058) [0x7f697ba54058]
  #28  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(_PyEval_EvalFrameDefault+0x50ae) [0x7f697b9df29e]
  #29  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x26bf9a) [0x7f697bb3ef9a]
  #30  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(+0x1810d8) [0x7f697ba540d8]
  #31  /lib/x86_64-linux-gnu/libpython3.11.so.1.0(_PyObject_Call+0x128) [0x7f697ba53b88]
  #32  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN6Python8PyCallKwEP7_objectPKcS3_z+0xe6) [0x55d26258c876]
  #33  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN2Nn13PythonControl16run_control_loopEv+0x5f) [0x55d262332fbf]
  #34  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN9NnTrainer13pythonControlEv+0x167) [0x55d2620df317]
  #35  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN9NnTrainer4mainERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS6_EE+0x303) [0x55d2620b8e13]
  #36  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN4Core11Application3runERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaIS7_EE+0x23) [0x55d26211e413]
  #37  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_ZN4Core11Application4mainEiPPc+0x577) [0x55d2620ba577]
  #38  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(main+0x3d) [0x55d2620b852d]
  #39  /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6947707d90]
  #40  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6947707e40]
  #41  /work/asr4/hilmes/dev/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_start+0x25) [0x55d2620dd7a5]
vieting commented 12 months ago

What about this? configuration error: failed to open file "neural-network-trainer.config" for reading. (No such file or directory)

I just use "sprint_opts" with "sprintConfigStr" for the fast_bw loss. Not sure why this "neural-network-trainer.config" is also checked. I do not define this anywhere in my config.

vieting commented 12 months ago

Note that the segmentation fault only occurs with the tf2.14 image and RASR. There might be something wrong on that side as well, see .

With my previous settings (tf2.13, RASR compiled with tf2.8), this is stdout + stderr

``` vieting@cn-251:/work/asr4/vieting/tmp/20231108_tf213_sprint_op$ ./run_example_patch.sh RETURNN starting up, version 1.20231108.140626+git.9fe93590, date/time 2023-11-08-17-07-35 (UTC+0100), pid 2131331, cwd /work/asr4/vieting/tmp/20231108_tf213_sprint_op, Python /usr/bin/python3 RETURNN command line options: ['returnn.config'] Hostname: cn-251 MEMORY: main proc python3(2131331) initial: rss=40.9MB pss=40.9MB uss=40.9MB shared=4.0KB MEMORY: total (1 procs): pss=40.9MB uss=40.9MB 2023-11-08 17:07:41.035240: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. MEMORY: main proc python3(2131331) increased RSS: rss=212.4MB pss=212.4MB uss=212.4MB shared=4.0KB MEMORY: total (1 procs): pss=212.4MB uss=212.4MB MEMORY: main proc python3(2131331) increased RSS: rss=283.6MB pss=283.6MB uss=283.6MB shared=4.0KB MEMORY: total (1 procs): pss=283.6MB uss=283.6MB MEMORY: main proc python3(2131331) increased RSS: rss=420.4MB pss=419.8MB uss=419.4MB shared=0.9MB MEMORY: total (1 procs): pss=419.8MB uss=419.4MB /work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/util/basic.py:2258: SyntaxWarning: "is not" with a literal. Did you mean "!="? if dim is not 1: /work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/util/basic.py:6254: SyntaxWarning: "is" with a literal. Did you mean "=="? if start is 0 and stop is None: TensorFlow: 2.13.0 (v2.13.0-rc2-7-g1cb1a030a62) ( in /usr/local/lib/python3.8/dist-packages/tensorflow) Use num_threads=1 (but min 2) via OMP_NUM_THREADS. Setup TF inter and intra global thread pools, num_threads 2, session opts {'log_device_placement': False, 'device_count': {'GPU': 0}, 'intra_op_parallelism_threads': 2, 'inter_op_parallelism_threads': 2}. CUDA_VISIBLE_DEVICES is set to '2'. Collecting TensorFlow device list... 2023-11-08 17:08:04.048461: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /device:GPU:0 with 10396 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1 Local devices available to TensorFlow: 1/2: name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 12364557139125826212 xla_global_id: -1 2/2: name: "/device:GPU:0" device_type: "GPU" memory_limit: 10901061632 locality { bus_id: 2 numa_node: 1 links { } } incarnation: 14856658680689284311 physical_device_desc: "device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1" xla_global_id: 416903419 Using gpu device 2: NVIDIA GeForce GTX 1080 Ti Hostname 'cn-251', GPU 2, GPU-dev-name 'NVIDIA GeForce GTX 1080 Ti', GPU-memory 10.2GB MEMORY: main proc python3(2131331) increased RSS: rss=1.1GB pss=1.0GB uss=1.0GB shared=5.5MB MEMORY: total (1 procs): pss=1.0GB uss=1.0GB Train data: input: 1 x 1 output: {'raw': {'dtype': 'string', 'shape': ()}, 'orth': [256, 1], 'data': [1, 2]} OggZipDataset, sequences: 249229, frames: unknown Dev data: MEMORY: main proc python3(2131331) increased RSS: rss=1.7GB pss=1.7GB uss=1.7GB shared=5.5MB MEMORY: total (1 procs): pss=1.7GB uss=1.7GB OggZipDataset, sequences: 300, frames: unknown Learning-rate-control: file learning_rates.swb.ctc does not exist yet Setup TF session with options {'log_device_placement': False, 'device_count': {'GPU': 1}} ... 2023-11-08 17:08:13.177173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10396 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1 layer /'data': [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)] float32 layer /features/'conv_h_filter': ['conv_h_filter:static:0'(128),'conv_h_filter:static:1'(1),F|F'conv_h_filter:static:2'(150)] float32 layer /features/'conv_h': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_act': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F|F'conv_h:channel'(150)] float32 layer /features/'conv_h_split': [B,T|'⌈((-63+time:var:extern_data:data)+-64)/5⌉'[B],F'conv_h:channel'(150),F|F'conv_h_split_split_dims1'(1)] float32 DEPRECATION WARNING: Explicitly specify in_spatial_dims when there is more than one spatial dim in the input. This will be disallowed with behavior_version 8. layer /features/'conv_l': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel'(150),F|F'conv_l:channel'(5)] float32 layer /features/'conv_l_merge': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 DEPRECATION WARNING: MergeDimsLayer, only keep_order=True is allowed This will be disallowed with behavior_version 6. layer /features/'conv_l_act_no_norm': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /features/'conv_l_act': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /features/'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'features': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 layer /'specaug': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F|F'conv_h:channel*conv_l:channel'(750)] float32 WARNING:tensorflow:From /work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py:2462: calling Zeros.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor layer /'conv_source': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_source_split_dims1'(1)] float32 layer /'conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],F'conv_h:channel*conv_l:channel'(750),F|F'conv_1:channel'(32)] float32 layer /'conv_1_pool': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/16⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_1:channel'(32)] float32 layer /'conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/32⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_2:channel'(64)] float32 layer /'conv_3': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'conv_h:channel*conv_l:channel//2'(375),F|F'conv_3:channel'(64)] float32 layer /'conv_merged': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conv_h:channel*conv_l:channel//2)*conv_3:channel'(24000)] float32 layer /'input_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'input_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'input_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_1_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_1_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_1_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_1': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_pointwise_conv_1:feature-dense'(1024)] float32 layer /'conformer_1_conv_mod_glu': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'(conformer_1_conv_mod_pointwise_conv_1:feature-dense)//2'(512)] float32 layer /'conformer_1_conv_mod_depthwise_conv': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_bn': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 DEPRECATION WARNING: batch_norm masked_time should be specified explicitly This will be disallowed with behavior_version 12. WARNING:tensorflow:From /work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/util/basic.py:1725: calling Ones.__init__ (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor layer /'conformer_1_conv_mod_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_pointwise_conv_2': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_conv_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_conv_mod_depthwise_conv:channel'(512)] float32 layer /'conformer_1_mhsa_mod_relpos_encoding': [T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_relpos_encoding_rel_pos_enc_feat'(64)] float32 layer /'conformer_1_mhsa_mod_self_attention': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_att_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_mhsa_mod_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_ln': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_mhsa_mod_self_attention_self_att_feat'(512)] float32 layer /'conformer_1_ffmod_2_linear_swish': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_linear_swish:feature-dense'(2048)] float32 layer /'conformer_1_ffmod_2_dropout_linear': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_dropout': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_ffmod_2_half_res_add': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'conformer_1_output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 layer /'encoder': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'conformer_1_ffmod_2_dropout_linear:feature-dense'(512)] float32 2023-11-08 17:08:14.118488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 10396 MB memory: -> device: 0, name: NVIDIA GeForce GTX 1080 Ti, pci bus id: 0000:81:00.0, compute capability: 6.1 layer /'output': [B,T|'⌈((-19+(⌈((-63+time:var:extern_data:data)+-64)/5⌉))+-20)/64⌉'[B],F|F'output:feature-dense'(88)] float32 WARNING:tensorflow:From /work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py:54: py_func (from tensorflow.python.ops.script_ops) is deprecated and will be removed in a future version. Instructions for updating: tf.py_func is deprecated in TF V2. Instead, there are two options available in V2. - tf.py_function takes a python function which manipulates tf eager tensors instead of numpy arrays. It's easy to convert a tf eager tensor to an ndarray (just call tensor.numpy()) but having access to eager tensors means `tf.py_function`s can use accelerators such as GPUs as well as being differentiable using a gradient tape. - tf.numpy_function maintains the semantics of the deprecated tf.py_func (it is not differentiable, and manipulates numpy arrays). It drops the stateful argument making all functions stateful. MEMORY: main proc python3(2131331) increased RSS: rss=1.9GB pss=1.9GB uss=1.8GB shared=31.8MB MEMORY: total (1 procs): pss=1.9GB uss=1.8GB OpCodeCompiler call: /usr/local/cuda-11.8/bin/nvcc -shared -O2 -std=c++17 -I /usr/local/lib/python3.8/dist-packages/tensorflow/include -I /usr/local/lib/python3.8/dist-packages/tensorflow/include/external/nsync/public -ccbin /usr/bin/gcc -I /usr/local/cuda-11.8/targets/x86_64-linux/include -I /usr/local/cuda-11.8/include -L /usr/local/cuda-11.8/lib64 -x cu -v -DGOOGLE_CUDA=1 -Xcompiler -fPIC -Xcompiler -v -arch compute_61 -I /usr/local/lib/python3.8/dist-packages/tensorflow/include/third_party/gpus/cuda/include -D_GLIBCXX_USE_CXX11_ABI=1 -DNDEBUG=1 -g /var/tmp/vieting/returnn_tf_cache/ops/FastBaumWelchOp/b50a371e1a/FastBaumWelchOp.cc -o /var/tmp/vieting/returnn_tf_cache/ops/FastBaumWelchOp/b50a371e1a/FastBaumWelchOp.so -L/usr/local/lib/python3.8/dist-packages/scipy.libs -l:libopenblasp-r0-41284840.3.18.so -L/usr/local/lib/python3.8/dist-packages/tensorflow -l:libtensorflow_framework.so.2 MEMORY: sub proc nvcc(2131947) initial: rss=3.4MB pss=2.0MB uss=0.9MB shared=2.5MB MEMORY: total (2 procs): pss=1.9GB uss=1.8GB MEMORY: sub proc nvcc(2131947) increased RSS: rss=3.5MB pss=2.1MB uss=1.5MB shared=2.0MB MEMORY: sub proc sh(2131954) initial: rss=1.6MB pss=603.0KB uss=236.0KB shared=1.3MB MEMORY: sub proc cicc(2131955) initial: rss=257.3MB pss=255.3MB uss=254.2MB shared=3.0MB MEMORY: total (4 procs): pss=2.1GB uss=2.1GB MEMORY: sub proc cicc(2131955) increased RSS: rss=1.0GB pss=1.0GB uss=1.0GB shared=3.0MB MEMORY: total (4 procs): pss=2.9GB uss=2.9GB MEMORY: proc (2131954) exited, old: rss=1.6MB pss=603.0KB uss=236.0KB shared=1.3MB MEMORY: proc cicc(2131955) exited, old: rss=1.0GB pss=1.0GB uss=1.0GB shared=3.0MB MEMORY: sub proc sh(2131963) initial: rss=1.6MB pss=605.0KB uss=228.0KB shared=1.4MB MEMORY: sub proc cudafe++(2131964) initial: rss=229.5MB pss=228.3MB uss=227.8MB shared=1.7MB MEMORY: total (4 procs): pss=2.1GB uss=2.1GB MEMORY: sub proc cudafe++(2131964) increased RSS: rss=1.1GB pss=1.1GB uss=1.1GB shared=1.7MB MEMORY: total (4 procs): pss=3.0GB uss=2.9GB MEMORY: proc (2131963) exited, old: rss=1.6MB pss=605.0KB uss=228.0KB shared=1.4MB MEMORY: proc cudafe++(2131964) exited, old: rss=1.1GB pss=1.1GB uss=1.1GB shared=1.7MB MEMORY: sub proc nvcc(2131947) increased RSS: rss=3.6MB pss=2.1MB uss=1.5MB shared=2.0MB MEMORY: sub proc sh(2131969) initial: rss=1.7MB pss=552.0KB uss=224.0KB shared=1.5MB MEMORY: sub proc gcc(2131970) initial: rss=2.6MB pss=1.4MB uss=1.0MB shared=1.6MB MEMORY: sub proc cc1plus(2131971) initial: rss=397.0MB pss=395.4MB uss=394.8MB shared=2.2MB MEMORY: total (5 procs): pss=2.2GB uss=2.2GB MEMORY: sub proc cc1plus(2131971) increased RSS: rss=0.8GB pss=0.8GB uss=0.8GB shared=2.2MB MEMORY: total (5 procs): pss=2.7GB uss=2.7GB Network layer topology: extern data: data: Tensor{[B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]}, seq_tag: Tensor{[B?], dtype='string'} used data keys: ['data', 'seq_tag'] layers: layer batch_norm 'conformer_1_conv_mod_bn' #: 512 layer conv 'conformer_1_conv_mod_depthwise_conv' #: 512 layer copy 'conformer_1_conv_mod_dropout' #: 512 layer gating 'conformer_1_conv_mod_glu' #: 512 layer layer_norm 'conformer_1_conv_mod_ln' #: 512 layer linear 'conformer_1_conv_mod_pointwise_conv_1' #: 1024 layer linear 'conformer_1_conv_mod_pointwise_conv_2' #: 512 layer combine 'conformer_1_conv_mod_res_add' #: 512 layer activation 'conformer_1_conv_mod_swish' #: 512 layer copy 'conformer_1_ffmod_1_dropout' #: 512 layer linear 'conformer_1_ffmod_1_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_1_half_res_add' #: 512 layer linear 'conformer_1_ffmod_1_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_1_ln' #: 512 layer copy 'conformer_1_ffmod_2_dropout' #: 512 layer linear 'conformer_1_ffmod_2_dropout_linear' #: 512 layer eval 'conformer_1_ffmod_2_half_res_add' #: 512 layer linear 'conformer_1_ffmod_2_linear_swish' #: 2048 layer layer_norm 'conformer_1_ffmod_2_ln' #: 512 layer linear 'conformer_1_mhsa_mod_att_linear' #: 512 layer copy 'conformer_1_mhsa_mod_dropout' #: 512 layer layer_norm 'conformer_1_mhsa_mod_ln' #: 512 layer relative_positional_encoding 'conformer_1_mhsa_mod_relpos_encoding' #: 64 layer combine 'conformer_1_mhsa_mod_res_add' #: 512 layer self_attention 'conformer_1_mhsa_mod_self_attention' #: 512 layer layer_norm 'conformer_1_output' #: 512 layer conv 'conv_1' #: 32 layer pool 'conv_1_pool' #: 32 layer conv 'conv_2' #: 64 layer conv 'conv_3' #: 64 layer merge_dims 'conv_merged' #: 24000 layer split_dims 'conv_source' #: 1 layer source 'data' #: 1 layer copy 'encoder' #: 512 layer subnetwork 'features' #: 750 layer conv 'features/conv_h' #: 150 layer eval 'features/conv_h_act' #: 150 layer variable 'features/conv_h_filter' #: 150 layer split_dims 'features/conv_h_split' #: 1 layer conv 'features/conv_l' #: 5 layer layer_norm 'features/conv_l_act' #: 750 layer eval 'features/conv_l_act_no_norm' #: 750 layer merge_dims 'features/conv_l_merge' #: 750 layer copy 'features/output' #: 750 layer copy 'input_dropout' #: 512 layer linear 'input_linear' #: 512 layer softmax 'output' #: 88 layer eval 'specaug' #: 750 net params #: 18473980 net trainable params: [, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ] 2023-11-08 17:09:01.409733: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:375] MLIR V1 optimization pass is not enabled start training at epoch 1 using batch size: {'classes': 5000, 'data': 400000}, max seqs: 128 learning rate control: NewbobMultiEpoch(num_epochs=6, update_interval=1, relative_error_threshold=-0.01, relative_error_grow_threshold=-0.01), epoch data: 1: EpochData(learningRate=1.325e-05, error={}), 2: EpochData(learningRate=1.539861111111111e-05, error={}), 3: EpochData(learningRate=1.754722222222222e-05, error={}), ..., 360: EpochData(learningRate=1.4333333333333375e-05, error={}), 361: EpochData(learningRate=1.2166666666666727e-05, error={}), 362: EpochData(learningRate=1e-05, error={}), error key: None pretrain: None MEMORY: proc (2131947) exited, old: rss=3.6MB pss=2.1MB uss=1.5MB shared=2.0MB MEMORY: proc (2131969) exited, old: rss=1.7MB pss=552.0KB uss=224.0KB shared=1.5MB MEMORY: proc (2131970) exited, old: rss=2.6MB pss=1.4MB uss=1.0MB shared=1.6MB MEMORY: proc cc1plus(2131971) exited, old: rss=0.8GB pss=0.8GB uss=0.8GB shared=2.2MB MEMORY: main proc python3(2131331) increased RSS: rss=2.3GB pss=2.3GB uss=2.3GB shared=6.4MB MEMORY: total (1 procs): pss=2.3GB uss=2.3GB start epoch 1 with learning rate 1.325e-05 ... TF: log_dir: output/models/train-2023-11-08-16-07-34 Create optimizer with options {'epsilon': 1e-08, 'learning_rate': }. Initialize optimizer (default) with slots ['m', 'v']. These additional variable were created by the optimizer: [, , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ]. 2023-11-08 17:09:08.816918: W tensorflow/c/c_api.cc:304] Operation '{name:'global_step' id:161 op device:{requested: '/device:CPU:0', assigned: ''} def:{{{node global_step}} = VarHandleOp[_class=["loc:@global_step"], _has_manual_control_dependencies=true, allowed_devices=[], container="", dtype=DT_INT64, shape=[], shared_name="global_step", _device="/device:CPU:0"]()}}' was changed by setting attribute after it was run by a session. This mutation will have no effect, and will trigger an error in the future. Either don't modify nodes after running them or create a new session. OpCodeCompiler call: /usr/local/cuda-11.8/bin/nvcc -shared -O2 -std=c++17 -I /usr/local/lib/python3.8/dist-packages/tensorflow/include -I /usr/local/lib/python3.8/dist-packages/tensorflow/include/external/nsync/public -ccbin /usr/bin/gcc -I /usr/local/cuda-11.8/targets/x86_64-linux/include -I /usr/local/cuda-11.8/include -L /usr/local/cuda-11.8/lib64 -x cu -v -DGOOGLE_CUDA=1 -Xcompiler -fPIC -Xcompiler -v -I /usr/local/lib/python3.8/dist-packages/tensorflow/include/third_party/gpus/cuda/include -D_GLIBCXX_USE_CXX11_ABI=1 -DNDEBUG=1 -g /var/tmp/vieting/returnn_tf_cache/ops/DevMaxBytesInUse/5fd1f0202b/DevMaxBytesInUse.cc -o /var/tmp/vieting/returnn_tf_cache/ops/DevMaxBytesInUse/5fd1f0202b/DevMaxBytesInUse.so -L/usr/local/lib/python3.8/dist-packages/tensorflow -l:libtensorflow_framework.so.2 MEMORY: main proc python3(2131331) increased RSS: rss=2.6GB pss=2.6GB uss=2.5GB shared=8.8MB MEMORY: sub proc nvcc(2131988) initial: rss=3.5MB pss=2.0MB uss=1.5MB shared=2.1MB MEMORY: sub proc sh(2131991) initial: rss=1.6MB pss=565.0KB uss=256.0KB shared=1.4MB MEMORY: sub proc gcc(2131992) initial: rss=2.5MB pss=1.3MB uss=1.0MB shared=1.6MB MEMORY: sub proc cc1plus(2131993) initial: rss=43.0MB pss=41.4MB uss=40.9MB shared=2.2MB MEMORY: total (5 procs): pss=2.6GB uss=2.6GB MEMORY: proc sh(2131991) exited, old: rss=1.6MB pss=565.0KB uss=256.0KB shared=1.4MB MEMORY: proc gcc(2131992) exited, old: rss=2.5MB pss=1.3MB uss=1.0MB shared=1.6MB MEMORY: proc cc1plus(2131993) exited, old: rss=43.0MB pss=41.4MB uss=40.9MB shared=2.2MB MEMORY: sub proc sh(2131994) initial: rss=1.7MB pss=633.0KB uss=232.0KB shared=1.5MB MEMORY: sub proc cicc(2131995) initial: rss=736.6MB pss=734.8MB uss=733.8MB shared=2.9MB MEMORY: total (4 procs): pss=3.3GB uss=3.3GB MEMORY: proc sh(2131994) exited, old: rss=1.7MB pss=633.0KB uss=232.0KB shared=1.5MB MEMORY: proc cicc(2131995) exited, old: rss=736.6MB pss=734.8MB uss=733.8MB shared=2.9MB MEMORY: sub proc nvcc(2131988) increased RSS: rss=3.6MB pss=2.2MB uss=1.5MB shared=2.0MB MEMORY: sub proc sh(2132005) initial: rss=1.6MB pss=613.0KB uss=232.0KB shared=1.4MB MEMORY: sub proc cudafe++(2132006) initial: rss=242.1MB pss=241.0MB uss=240.5MB shared=1.6MB MEMORY: total (4 procs): pss=2.8GB uss=2.8GB MEMORY: proc sh(2132005) exited, old: rss=1.6MB pss=613.0KB uss=232.0KB shared=1.4MB MEMORY: proc cudafe++(2132006) exited, old: rss=242.1MB pss=241.0MB uss=240.5MB shared=1.6MB MEMORY: sub proc sh(2132007) initial: rss=1.6MB pss=531.0KB uss=224.0KB shared=1.4MB MEMORY: sub proc gcc(2132008) initial: rss=2.6MB pss=1.4MB uss=1.0MB shared=1.6MB MEMORY: sub proc cc1plus(2132009) initial: rss=121.0MB pss=119.5MB uss=119.0MB shared=2.1MB MEMORY: total (5 procs): pss=2.7GB uss=2.7GB MEMORY: sub proc cc1plus(2132009) increased RSS: rss=515.9MB pss=514.4MB uss=513.9MB shared=2.1MB MEMORY: total (5 procs): pss=3.1GB uss=3.1GB SprintSubprocessInstance: exec ['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:35,p2c_fd:36,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=no', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1'] SprintSubprocessInstance: starting, pid 2132023 MEMORY: proc (2131988) exited, old: rss=3.6MB pss=2.2MB uss=1.5MB shared=2.0MB MEMORY: proc (2132007) exited, old: rss=1.6MB pss=531.0KB uss=224.0KB shared=1.4MB MEMORY: proc (2132008) exited, old: rss=2.6MB pss=1.4MB uss=1.0MB shared=1.6MB MEMORY: proc cc1plus(2132009) exited, old: rss=515.9MB pss=514.4MB uss=513.9MB shared=2.1MB /work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard: error while loading shared libraries: libtensorflow_cc.so.2: cannot open shared object file: No such file or directory SprintSubprocessInstance: Sprint child process (['/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', '--*.python-control-enabled=true', '--*.pymod-path=/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn', '--*.pymod-name=returnn.sprint.control', '--*.pymod-config=c2p_fd:35,p2c_fd:36,minPythonControlVersion:4', '--*.configuration.channel=output-channel', '--*.real-time-factor.channel=output-channel', '--*.system-info.channel=output-channel', '--*.time.channel=output-channel', '--*.version.channel=output-channel', '--*.log.channel=output-channel', '--*.warning.channel=output-channel,', 'stderr', '--*.error.channel=output-channel,', 'stderr', '--*.statistics.channel=output-channel', '--*.progress.channel=output-channel', '--*.dot.channel=nil', '--*.corpus.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/datasets/switchboard/CreateSwitchboardBlissCorpusJob.Z1EMi4TdrUS6/output/swb.corpus.xml.gz', '--*.corpus.segments.file=/u/vieting/setups/swb/20230406_feat/work/i6_core/corpus/filter/FilterSegmentsByListJob.nrKcBIdsMBZm/output/segments.1', '--*.model-combination.lexicon.file=/u/vieting/setups/swb/20230406_feat/work/i6_experiments/users/berger/recipe/lexicon/modification/MakeBlankLexiconJob.N8RlHYKzilei/output/lexicon.xml', '--*.model-combination.acoustic-model.state-tying.type=lookup', '--*.model-combination.acoustic-model.state-tying.file=/u/vieting/setups/swb/20230406_feat/dependencies/state-tying_blank', '--*.model-combination.acoustic-model.allophones.add-from-lexicon=no', '--*.model-combination.acoustic-model.allophones.add-all=yes', '--*.model-combination.acoustic-model.allophones.add-from-file=/u/vieting/setups/swb/20230406_feat/dependencies/allophones_blank', '--*.model-combination.acoustic-model.hmm.states-per-phone=1', '--*.model-combination.acoustic-model.hmm.state-repetitions=1', '--*.model-combination.acoustic-model.hmm.across-word-model=yes', '--*.model-combination.acoustic-model.hmm.early-recombination=no', '--*.model-combination.acoustic-model.tdp.scale=1.0', '--*.model-combination.acoustic-model.tdp.*.loop=0.0', '--*.model-combination.acoustic-model.tdp.*.forward=0.0', '--*.model-combination.acoustic-model.tdp.*.skip=infinity', '--*.model-combination.acoustic-model.tdp.*.exit=0.0', '--*.model-combination.acoustic-model.tdp.silence.loop=0.0', '--*.model-combination.acoustic-model.tdp.silence.forward=0.0', '--*.model-combination.acoustic-model.tdp.silence.skip=infinity', '--*.model-combination.acoustic-model.tdp.silence.exit=0.0', '--*.model-combination.acoustic-model.tdp.entry-m1.loop=infinity', '--*.model-combination.acoustic-model.tdp.entry-m2.loop=infinity', '--*.model-combination.acoustic-model.phonology.history-length=0', '--*.model-combination.acoustic-model.phonology.future-length=0', '--*.transducer-builder-filter-out-invalid-allophones=yes', '--*.fix-allophone-context-at-word-boundaries=yes', '--*.allophone-state-graph-builder.topology=ctc', '--*.allow-for-silence-repetitions=no', '--action=python-control', '--python-control-loop-type=python-control-loop', '--extract-features=no', '--*.encoding=UTF-8', '--*.output-channel.file=$(LOGFILE)', '--*.output-channel.compressed=no', '--*.output-channel.append=no', '--*.output-channel.unbuffered=no', '--*.LOGFILE=nn-trainer.loss.log', '--*.TASK=1']) caused an exception. MEMORY: total (1 procs): pss=2.6GB uss=2.5GB EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in SprintSubprocessInstance._start_child line: ret = self._read() locals: ret = self = self._read = > File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in SprintSubprocessInstance._read line: return util.read_pickled_object(p) locals: util = util.read_pickled_object = p = <_io.FileIO name=34 mode='rb' closefd=True> File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object line: size_raw = read_bytes_to_new_buffer(p, 4).getvalue() locals: size_raw = read_bytes_to_new_buffer = p = <_io.FileIO name=34 mode='rb' closefd=True> getvalue = File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer line: raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) locals: EOFError = size = 4 read_size = 0 EOFError: expected to read 4 bytes but got EOF after 0 bytes Exception in py_wrap_get_sprint_automata_for_batch: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in SprintSubprocessInstance._start_child line: ret = self._read() locals: ret = self = self._read = > File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in SprintSubprocessInstance._read line: return util.read_pickled_object(p) locals: util = util.read_pickled_object = p = <_io.FileIO name=34 mode='rb' closefd=True> File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object line: size_raw = read_bytes_to_new_buffer(p, 4).getvalue() locals: size_raw = read_bytes_to_new_buffer = p = <_io.FileIO name=34 mode='rb' closefd=True> getvalue = File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer line: raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) locals: EOFError = size = 4 read_size = 0 EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in get_sprint_automata_for_batch_op..py_wrap_get_sprint_automata_for_batch line: return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) locals: py_get_sprint_automata_for_batch = sprint_opts = {'sprintExecPath': '/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', 'sprintConfigStr': '--*.configuration.channel=output-channel --*.real-time-factor.channel=output-channel --*.system-info.channel=output-channel --*.time.channel=output-... tags = py_tags = array([b'switchboard-1/sw02721B/sw2721B-ms98-a-0031', b'switchboard-1/sw02427A/sw2427A-ms98-a-0021', b'switchboard-1/sw02848B/sw2848B-ms98-a-0086', b'switchboard-1/sw04037A/sw4037A-ms98-a-0027', b'switchboard-1/sw02370B/sw2370B-ms98-a-0117', b'switchboard-1/sw02... File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch line: edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) locals: edges = weights = start_end_states = sprint_instance_pool = sprint_instance_pool.get_automata_for_batch = > tags = array([b'switchboard-1/sw02721B/sw2721B-ms98-a-0031', b'switchboard-1/sw02427A/sw2427A-ms98-a-0021', b'switchboard-1/sw02848B/sw2848B-ms98-a-0086', b'switchboard-1/sw04037A/sw4037A-ms98-a-0027', b'switchboard-1/sw02370B/sw2370B-ms98-a-0117', b'switchboard-1/sw02... File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in SprintInstancePool.get_automata_for_batch line: instance = self._get_instance(i) locals: instance = self = self._get_instance = > i = 0 File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in SprintInstancePool._get_instance line: self._maybe_create_new_instance() locals: self = self._maybe_create_new_instance = > File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in SprintInstancePool._maybe_create_new_instance line: self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) locals: self = self.instances = [] self.instances.append = SprintSubprocessInstance = self.sprint_opts = {'sprintExecPath': '/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard', 'sprintConfigStr': '--*.configuration.channel=output-channel --*.real-time-factor.channel=output-channel --*.system-info.channel=output-channel --*.time.channel=output-... File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in SprintSubprocessInstance.__init__ line: self.init() locals: self = self.init = > File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in SprintSubprocessInstance.init line: self._start_child() locals: self = self._start_child = > File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in SprintSubprocessInstance._start_child line: raise Exception("SprintSubprocessInstance Sprint init failed") locals: Exception = Exception: SprintSubprocessInstance Sprint init failed 2023-11-08 17:09:37.114349: W tensorflow/core/framework/op_kernel.cc:1816] UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed 2023-11-08 17:09:37.114515: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 14907759204653744683 2023-11-08 17:09:37.114540: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 11924807411687211681 2023-11-08 17:09:37.114558: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 8498381501270362003 2023-11-08 17:09:37.114592: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 328642183433865367 2023-11-08 17:09:37.114608: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 15509202514790697743 2023-11-08 17:09:37.114638: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 12478617659299189133 2023-11-08 17:09:37.114656: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 17119912705987515863 2023-11-08 17:09:37.114671: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 1116834209094735605 2023-11-08 17:09:37.114687: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 4661036471183676975 2023-11-08 17:09:37.114703: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 17206736268075489981 2023-11-08 17:09:37.114723: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 11940517361119239617 2023-11-08 17:09:37.114737: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 2075000341389533861 2023-11-08 17:09:37.114757: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 1551945598752204051 2023-11-08 17:09:37.114773: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 18024994189871473987 2023-11-08 17:09:37.114787: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 8039025426040121703 2023-11-08 17:09:37.114801: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 12780907590735407947 2023-11-08 17:09:37.114832: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 18105505433626603299 2023-11-08 17:09:37.114848: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 14023509702728807603 2023-11-08 17:09:37.114861: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 4387189208380191869 2023-11-08 17:09:37.114877: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 15290859676350821985 2023-11-08 17:09:37.114891: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 4708683971917804685 2023-11-08 17:09:37.114905: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 782629118718604739 2023-11-08 17:09:37.114915: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send item cancelled. Key hash: 16178361428648949333 2023-11-08 17:09:37.114930: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 1368360081948114135 2023-11-08 17:09:37.114956: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 9684463615367594434 2023-11-08 17:09:37.114970: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 11191673837626951548 2023-11-08 17:09:37.114986: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 4601451330222918362 2023-11-08 17:09:37.115000: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 14060714862683982606 2023-11-08 17:09:37.115032: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 16737683200926961030 2023-11-08 17:09:37.115046: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 17857287931859718032 2023-11-08 17:09:37.115059: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 5354699002852183842 2023-11-08 17:09:37.115073: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 12547361387349856700 2023-11-08 17:09:37.115087: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 15404591707848971056 2023-11-08 17:09:37.115101: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 7479360675682653368 2023-11-08 17:09:37.115115: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 15409731113398965776 2023-11-08 17:09:37.115131: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 9679296465648687078 2023-11-08 17:09:37.115145: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 9282137006686686836 2023-11-08 17:09:37.115158: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 9017255699680893100 2023-11-08 17:09:37.115172: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 16662337826391890718 2023-11-08 17:09:37.115186: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 6549064369067171100 2023-11-08 17:09:37.115225: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 5592458713738762450 2023-11-08 17:09:37.115243: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 6034280818993323922 2023-11-08 17:09:37.115258: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 18200915710976925794 2023-11-08 17:09:37.115271: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 15218690700986048972 2023-11-08 17:09:37.115284: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 8950560704742236676 2023-11-08 17:09:37.115294: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send item cancelled. Key hash: 15258328697247900912 2023-11-08 17:09:37.115308: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 5450317640836131402 2023-11-08 17:09:37.115328: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 7607626667450182958 2023-11-08 17:09:37.115342: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 18231059680337670234 2023-11-08 17:09:37.115355: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 15520128163238770216 2023-11-08 17:09:37.115365: I tensorflow/core/framework/local_rendezvous.cc:409] Local rendezvous send item cancelled. Key hash: 6445139679874136070 2023-11-08 17:09:37.115379: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 5004971731649411668 2023-11-08 17:09:37.115409: I tensorflow/core/framework/local_rendezvous.cc:405] Local rendezvous recv item cancelled. Key hash: 4347196143763668518 MEMORY: main proc python3(2131331) increased RSS: rss=2.7GB pss=2.7GB uss=2.7GB shared=6.4MB MEMORY: total (1 procs): pss=2.7GB uss=2.7GB MEMORY: main proc python3(2131331) increased RSS: rss=2.8GB pss=2.8GB uss=2.8GB shared=6.4MB MEMORY: total (1 procs): pss=2.8GB uss=2.8GB MEMORY: main proc python3(2131331) increased RSS: rss=3.0GB pss=3.0GB uss=3.0GB shared=6.4MB MEMORY: total (1 procs): pss=3.0GB uss=3.0GB 2023-11-08 17:09:51.148252: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:432] Loaded cuDNN version 8600 MEMORY: main proc python3(2131331) increased RSS: rss=3.2GB pss=3.2GB uss=3.2GB shared=6.4MB MEMORY: total (1 procs): pss=3.2GB uss=3.2GB MEMORY: main proc python3(2131331) increased RSS: rss=3.4GB pss=3.4GB uss=3.4GB shared=6.4MB MEMORY: total (1 procs): pss=3.4GB uss=3.4GB TensorFlow exception: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler return dispatch_target(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func return py_func_common(func, inp, Tout, stateful, name=name) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common return _internal_py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func result = gen_script_ops.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func _, _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal ret = Operation.from_node_def( Exception UnknownError() in step 0. (pid 2131331) Failing op: We tried to fetch the op inputs ([]) but got another exception: target_op , ops [] EXCEPTION Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1379, in BaseSession._do_call line: return fn(*args) locals: fn = ._run_fn at 0x7fdc85e97b80> args = ({: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.00... File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1362, in BaseSession._do_run.._run_fn line: return self._call_tf_sessionrun(options, feed_dict, fetch_list, target_list, run_metadata) locals: self = self._call_tf_sessionrun = > options = None feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list = [, , , [, ] run_metadata = None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1455, in BaseSession._call_tf_sessionrun line: return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, fetch_list, target_list, run_metadata) locals: tf_session = tf_session.TF_SessionRun_wrapper = self = self._session = options = None feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list = [, , , [, ] run_metadata = None UnknownError: 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 744, in Runner.run line: fetches_results = sess.run( fetches_dict, feed_dict=feed_dict, options=run_options ) # type: typing.Dict[str,typing.Union[numpy.ndarray,str]] locals: fetches_results = sess = sess.run = > fetches_dict = {'size:data:0': , 'loss': , 'cost:output': , 'loss_norm_..., len = 8 feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options = run_options = None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 969, in BaseSession.run line: result = self._run(None, fetches, feed_dict, options_ptr, run_metadata_ptr) locals: result = self = self._run = > fetches = {'size:data:0': , 'loss': , 'cost:output': , 'loss_norm_..., len = 8 feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options_ptr = None run_metadata_ptr = None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1192, in BaseSession._run line: results = self._do_run(handle, final_targets, final_fetches, feed_dict_tensor, options, run_metadata) locals: results = self = self._do_run = > handle = None final_targets = [, ] final_fetches = [, , , {>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049... options = None run_metadata = None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1372, in BaseSession._do_run line: return self._do_call(_run_fn, feeds, fetches, targets, options, run_metadata) locals: self = self._do_call = > _run_fn = ._run_fn at 0x7fdc85e97b80> feeds = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetches = [, , , [, ] options = None run_metadata = None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1398, in BaseSession._do_call line: raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter locals: type = e = node_def = name: "objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch" op: "PyFunc" input: "extern_data/placeholders/seq_tag/seq_tag" attr { key: "token" value { s: "pyfunc_0" } } attr { key: "Tout" value { list { type: DT_INT32 type: DT_FLOAT type: DT_INT... op = message = 'Graph execution error:\n\nDetected at node \'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch\' defined at (most recent call last):\n File "./returnn/rnn.py", line 11, in \n main()\n File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__mai..., len = 12234 UnknownError: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler return dispatch_target(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func return py_func_common(func, inp, Tout, stateful, name=name) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common return _internal_py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func result = gen_script_ops.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func _, _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal ret = Operation.from_node_def( During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4341, in help_on_tf_exception line: debug_fetch, fetch_helpers, op_copied = FetchHelper.copy_graph( debug_fetch, target_op=op, fetch_helper_tensors=list(op.inputs), stop_at_ts=stop_at_ts, verbose_stream=file, ) locals: debug_fetch = fetch_helpers = op_copied = FetchHelper = FetchHelper.copy_graph = > target_op = op = fetch_helper_tensors = list = op.inputs = (,) stop_at_ts = [, , , file = File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/util/basic.py", line 7700, in FetchHelper.copy_graph line: assert target_op in ops, "target_op %r,\nops\n%s" % (target_op, pformat(ops)) locals: target_op = ops = [] pformat = AssertionError: target_op , ops [] Step meta information: {'seq_idx': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38], 'seq_tag': ['switchboard-1/sw02721B/sw2721B-ms98-a-0031', 'switchboard-1/sw02427A/sw2427A-ms98-a-0021', 'switchboard-1/sw02848B/sw2848B-ms98-a-0086', 'switchboard-1/sw04037A/sw4037A-ms98-a-0027', 'switchboard-1/sw02370B/sw2370B-ms98-a-0117', 'switchboard-1/sw02145A/sw2145A-ms98-a-0107', 'switchboard-1/sw02484A/sw2484A-ms98-a-0077', 'switchboard-1/sw02768A/sw2768A-ms98-a-0064', 'switchboard-1/sw03312B/sw3312B-ms98-a-0041', 'switchboard-1/sw02344B/sw2344B-ms98-a-0023', 'switchboard-1/sw04248B/sw4248B-ms98-a-0017', 'switchboard-1/sw02762A/sw2762A-ms98-a-0059', 'switchboard-1/sw03146A/sw3146A-ms98-a-0047', 'switchboard-1/sw03032A/sw3032A-ms98-a-0065', 'switchboard-1/sw02288A/sw2288A-ms98-a-0080', 'switchboard-1/sw02751A/sw2751A-ms98-a-0066', 'switchboard-1/sw02369A/sw2369A-ms98-a-0118', 'switchboard-1/sw04169A/sw4169A-ms98-a-0059', 'switchboard-1/sw02227A/sw2227A-ms98-a-0016', 'switchboard-1/sw02061B/sw2061B-ms98-a-0170', 'switchboard-1/sw02862B/sw2862B-ms98-a-0033', 'switchboard-1/sw03116B/sw3116B-ms98-a-0065', 'switchboard-1/sw03517B/sw3517B-ms98-a-0038', 'switchboard-1/sw02360B/sw2360B-ms98-a-0086', 'switchboard-1/sw02510B/sw2510B-ms98-a-0061', 'switchboard-1/sw03919A/sw3919A-ms98-a-0017', 'switchboard-1/sw02965A/sw2965A-ms98-a-0045', 'switchboard-1/sw03154A/sw3154A-ms98-a-0073', 'switchboard-1/sw02299A/sw2299A-ms98-a-0005', 'switchboard-1/sw04572A/sw4572A-ms98-a-0026', 'switchboard-1/sw02682A/sw2682A-ms98-a-0022', 'switchboard-1/sw02808A/sw2808A-ms98-a-0014', 'switchboard-1/sw04526A/sw4526A-ms98-a-0026', 'switchboard-1/sw03180B/sw3180B-ms98-a-0010', 'switchboard-1/sw03227A/sw3227A-ms98-a-0029', 'switchboard-1/sw03891B/sw3891B-ms98-a-0008', 'switchboard-1/sw03882B/sw3882B-ms98-a-0041', 'switchboard-1/sw03102B/sw3102B-ms98-a-0027', 'switchboard-1/sw02454A/sw2454A-ms98-a-0029']} Feed dict: : int(39) : shape (39, 10208, 1), dtype float32, min/max -1.0/1.0, mean/stddev 0.0014351769/0.11459725, Tensor{'data', [B,T|'time:var:extern_data:data'[B],F|F'feature:data'(1)]} : shape (39,), dtype int32, min/max 4760/10208, ([ 4760 6246 6372 6861 7296 7499 7534 7622 7824 8031 8295 8431 8690 8675 8667 8886 9084 9199 9163 9156 9274 9262 9540 9668 9678 9719 9711 9902 9989 10010 10020 10073 10006 10102 10131 10112 10130 10178 10208]) : type , Tensor{'seq_tag', [B?], dtype='string'} : bool(True) EXCEPTION Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1379, in BaseSession._do_call line: return fn(*args) locals: fn = ._run_fn at 0x7fdc85e97b80> args = ({: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.00... File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1362, in BaseSession._do_run.._run_fn line: return self._call_tf_sessionrun(options, feed_dict, fetch_list, target_list, run_metadata) locals: self = self._call_tf_sessionrun = > options = None feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list = [, , , [, ] run_metadata = None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1455, in BaseSession._call_tf_sessionrun line: return tf_session.TF_SessionRun_wrapper(self._session, options, feed_dict, fetch_list, target_list, run_metadata) locals: tf_session = tf_session.TF_SessionRun_wrapper = self = self._session = options = None feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetch_list = [, , , [, ] run_metadata = None UnknownError: 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: EXCEPTION Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 744, in Runner.run line: fetches_results = sess.run( fetches_dict, feed_dict=feed_dict, options=run_options ) # type: typing.Dict[str,typing.Union[numpy.ndarray,str]] locals: fetches_results = sess = sess.run = > fetches_dict = {'size:data:0': , 'loss': , 'cost:output': , 'loss_norm_..., len = 8 feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options = run_options = None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 969, in BaseSession.run line: result = self._run(None, fetches, feed_dict, options_ptr, run_metadata_ptr) locals: result = self = self._run = > fetches = {'size:data:0': , 'loss': , 'cost:output': , 'loss_norm_..., len = 8 feed_dict = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... options_ptr = None run_metadata_ptr = None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1192, in BaseSession._run line: results = self._do_run(handle, final_targets, final_fetches, feed_dict_tensor, options, run_metadata) locals: results = self = self._do_run = > handle = None final_targets = [, ] final_fetches = [, , , {>: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049... options = None run_metadata = None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1372, in BaseSession._do_run line: return self._do_call(_run_fn, feeds, fetches, targets, options, run_metadata) locals: self = self._do_call = > _run_fn = ._run_fn at 0x7fdc85e97b80> feeds = {: array([[[-0.05505638], [-0.09610788], [-0.05115783], ..., [ 0. ], [ 0. ], [ 0. ]], [[-0.00226238], [-0.01049833], [-0.001... fetches = [, , , [, ] options = None run_metadata = None File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/client/session.py", line 1398, in BaseSession._do_call line: raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter locals: type = e = node_def = name: "objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch" op: "PyFunc" input: "extern_data/placeholders/seq_tag/seq_tag" attr { key: "token" value { s: "pyfunc_0" } } attr { key: "Tout" value { list { type: DT_INT32 type: DT_FLOAT type: DT_INT... op = message = 'Graph execution error:\n\nDetected at node \'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch\' defined at (most recent call last):\n File "./returnn/rnn.py", line 11, in \n main()\n File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__mai..., len = 12234 UnknownError: Graph execution error: Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' Detected at node 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' defined at (most recent call last): File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( Node: 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch' 2 root error(s) found. (0) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] [[objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch/_661]] (1) UNKNOWN: Exception: SprintSubprocessInstance Sprint init failed Traceback (most recent call last): File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 165, in _start_child ret = self._read() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 226, in _read return util.read_pickled_object(p) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2629, in read_pickled_object size_raw = read_bytes_to_new_buffer(p, 4).getvalue() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/util/basic.py", line 2612, in read_bytes_to_new_buffer raise EOFError("expected to read %i bytes but got EOF after %i bytes" % (size, read_size)) EOFError: expected to read 4 bytes but got EOF after 0 bytes During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 268, in __call__ ret = func(*args) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/autograph/impl/api.py", line 643, in wrapper return func(*args, **kwargs) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 45, in py_wrap_get_sprint_automata_for_batch return py_get_sprint_automata_for_batch(sprint_opts=sprint_opts, tags=py_tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 20, in py_get_sprint_automata_for_batch edges, weights, start_end_states = sprint_instance_pool.get_automata_for_batch(tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 512, in get_automata_for_batch instance = self._get_instance(i) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 418, in _get_instance self._maybe_create_new_instance() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 406, in _maybe_create_new_instance self.instances.append(SprintSubprocessInstance(**self.sprint_opts)) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 81, in __init__ self.init() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 303, in init self._start_child() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/sprint/error_signals.py", line 170, in _start_child raise Exception("SprintSubprocessInstance Sprint init failed") Exception: SprintSubprocessInstance Sprint init failed [[{{node objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch}}]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'objective/loss/loss/FastBaumWelchLoss/get_sprint_automata_for_batch': File "./returnn/rnn.py", line 11, in main() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 634, in main execute_main_task() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/__main__.py", line 439, in execute_main_task engine.init_train_from_config(config, train_data, dev_data, eval_data) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1149, in init_train_from_config self.init_network_from_config(config) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1234, in init_network_from_config self._init_network(net_desc=net_dict, epoch=self.epoch) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1429, in _init_network self.network, self.updater = self.create_network( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/engine.py", line 1491, in create_network updater = Updater(config=config, network=network, initial_learning_rate=initial_learning_rate) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/updater.py", line 172, in __init__ self.loss = network.get_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1552, in get_objective self.maybe_construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1545, in maybe_construct_objective self._construct_objective() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1529, in _construct_objective losses_dict, total_loss, total_constraints = self.get_losses_initialized(with_total=True) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 1499, in get_losses_initialized if loss_obj.get_loss_value_for_objective() is not None: File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 3957, in get_loss_value_for_objective self._prepare() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/network.py", line 4080, in _prepare self._loss_value = self.loss.get_value() File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/layers/basic.py", line 13165, in get_value fwdbwd, obs_scores = fast_baum_welch_by_sprint_automata( File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/native_op.py", line 1420, in fast_baum_welch_by_sprint_automata edges, weights, start_end_states = get_sprint_automata_for_batch_op(sprint_opts=sprint_opts, tags=tags) File "/work/asr4/vieting/tmp/20231108_tf213_sprint_op/returnn/returnn/tf/sprint.py", line 54, in get_sprint_automata_for_batch_op edges, weights, start_end_states = tf_compat.v1.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/deprecation.py", line 371, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/traceback_utils.py", line 150, in error_handler return fn(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/util/dispatch.py", line 1176, in op_dispatch_handler return dispatch_target(*args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 678, in py_func return py_func_common(func, inp, Tout, stateful, name=name) File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 653, in py_func_common return _internal_py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/script_ops.py", line 378, in _internal_py_func result = gen_script_ops.py_func( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/gen_script_ops.py", line 149, in py_func _, _, _op, _outputs = _op_def_library._apply_op_helper( File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/op_def_library.py", line 795, in _apply_op_helper op = g._create_op_internal(op_type_name, inputs, dtypes=None, File "/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/ops.py", line 3381, in _create_op_internal ret = Operation.from_node_def( Save model under output/models/epoch.001.crash_0 Trainer not finalized, quitting. (pid 2131331) ```
albertz commented 12 months ago

There it seems that RASR does not start at all. I see:

/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard: error while loading shared libraries: libtensorflow_cc.so.2: cannot open shared object file: No such file or directory
albertz commented 12 months ago

Btw, the RASR segmentation fault looks actually like a bug in RASR. RASR should never segfault.

Marvin84 commented 12 months ago

Most of rasr problems result in segmentation fault. Sometimes you get more info, sometimes it's only about a not consistent compilation.

On Wed, Nov 8, 2023, 17:43 Albert Zeyer @.***> wrote:

Btw, the RASR segmentation fault looks actually like a bug in RASR. RASR should never segfault.

— Reply to this email directly, view it on GitHub https://github.com/rwth-i6/returnn/issues/1456#issuecomment-1802268733, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEQ6G5P4AKNQGS6QKNZFEMTYDOZD3AVCNFSM6AAAAAA7CWIXA2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBSGI3DQNZTGM . You are receiving this because you commented.Message ID: @.***>

albertz commented 12 months ago

Whenever RASR gives a segfault, that's a bug in RASR. It should never segfault. Can you link corresponding RASR issues here? Or if this is not reported yet, can you open a corresponding RASR issue?

vieting commented 11 months ago

I created a RASR issue about the segfault in RASR with the tf2.14 image and RASR: https://github.com/rwth-i6/rasr/issues/68

albertz commented 11 months ago

With my previous settings (tf2.13, RASR compiled with tf2.8)

There it seems that RASR does not start at all. I see:

/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard: error while loading shared libraries: libtensorflow_cc.so.2: cannot open shared object file: No such file or directory

@vieting Did you look at that? Did you fix it? Maybe it just needs the right LD_LIBRARY_PATH. Or use some other RASR, maybe one without TF.

vieting commented 11 months ago

With my previous settings (tf2.13, RASR compiled with tf2.8)

There it seems that RASR does not start at all. I see:

/work/asr4/vieting/programs/rasr/20230707/rasr/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard: error while loading shared libraries: libtensorflow_cc.so.2: cannot open shared object file: No such file or directory

@vieting Did you look at that? Did you fix it? Maybe it just needs the right LD_LIBRARY_PATH. Or use some other RASR, maybe one without TF.

I just tried with the tf2.13 image and a RASR that was compiled without TF. There, I also get a segmentation fault. It looks identical to the one in https://github.com/rwth-i6/rasr/issues/68.

Segmentation fault                                                                                                                                                                           [1321/42960]

Creating stack trace (innermost first):
#2  /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7fe88e9ec420]
#3  /lib/x86_64-linux-gnu/libpthread.so.0(raise+0xcb) [0x7fe88e9ec2ab]
#4  /lib/x86_64-linux-gnu/libpthread.so.0(+0x14420) [0x7fe88e9ec420]
#5  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Ftl::TrimAutomaton<Fsa::Automaton>::getState(unsigned int) const+0x3a) [0x55e5abccd12a]
#6  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Ftl::CacheAutomaton<Fsa::Automaton>::getState(unsigned int) const+0x373) [0x55e5abcdd653]
#7  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(+0xa0cdd3) [0x55e5abc54dd3]
#8  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(+0xa0f376) [0x55e5abc57376]
#9  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Am::TransitionModel::apply(Core::Ref<Fsa::Automaton const>, int, bool) const+0x25b) [0x55e5abc4fc2b]
#10  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Am::ClassicTransducerBuilder::applyTransitionModel(Core::Ref<Fsa::Automaton const>)+0x34d) [0x55e5abc436dd]
#11  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Speech::AllophoneStateGraphBuilder::addLoopTransition(Core::Ref<Fsa::Automaton const>)+0x11e) [0x55e5abb05cde]
#12  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Speech::CTCTopologyGraphBuilder::addLoopTransition(Core::Ref<Fsa::Automaton const>)+0x45) [0x55e5abb08e05]
#13  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Speech::CTCTopologyGraphBuilder::buildTransducer(Core::Ref<Fsa::Automaton const>)+0x80) [0x55e5abb0a4a0]
#14  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Speech::AllophoneStateGraphBuilder::buildTransducer(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x60) [0x55e5abb06d90]
#15  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Speech::AllophoneStateGraphBuilder::build(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0x2e) [0x55e5abb087de]
#16  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Nn::AllophoneStateFsaExporter::exportFsaForOrthography(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) const+0x4c) [0x55e5ab9a9d1c]
#17  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Nn::PythonControl::Internal::exportAllophoneStateFsaBySegName(_object*, _object*)+0x108) [0x55e5ab990918]
#18  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Nn::PythonControl::Internal::callback(_object*, _object*)+0x297) [0x55e5ab990fb7]
#19  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8748) [0x7fe88a2f4748]
#20  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyObject_MakeTpCall+0xab) [0x7fe88a2f4b2b]
#21  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74df3) [0x7fe88a0c0df3]
#22  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x7d86) [0x7fe88a0c8ef6]
#23  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b) [0x7fe88a0cc06b]
#24  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8d37) [0x7fe88a2f4d37]
#25  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyVectorcall_Call+0x60) [0x7fe88a2f4840]
#26  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x590a) [0x7fe88a0c6a7a]
#27  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8fb) [0x7fe88a216e4b]
#28  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x94) [0x7fe88a2f4124]
#29  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8d37) [0x7fe88a2f4d37]
#30  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyVectorcall_Call+0x60) [0x7fe88a2f4840]
#31  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0x590a) [0x7fe88a0c6a7a]
#32  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x8006b) [0x7fe88a0cc06b]
#33  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x74d6d) [0x7fe88a0c0d6d]
#34  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalFrameDefault+0xea8) [0x7fe88a0c2018]
#35  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyEval_EvalCodeWithName+0x8fb) [0x7fe88a216e4b]
#36  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(_PyFunction_Vectorcall+0x94) [0x7fe88a2f4124]
#37  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(+0x2a8d37) [0x7fe88a2f4d37]
#38  /lib/x86_64-linux-gnu/libpython3.8.so.1.0(PyVectorcall_Call+0x60) [0x7fe88a2f4840]
#39  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Python::PyCallKw(_object*, char const*, char const*, ...)+0xe6) [0x55e5abc2ab56]
#40  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Nn::PythonControl::run_control_loop()+0x66) [0x55e5ab984246]
#41  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(NnTrainer::pythonControl()+0x117) [0x55e5ab701ed7]
#42  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(NnTrainer::main(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)+0x304) [0x55e5ab6dcdf4]
#43  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Core::Application::run(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)+0x23) [0x55e5ab748913]
#44  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(Core::Application::main(int, char**)+0x5fb) [0x55e5ab6de69b]
#45  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(main+0x3d) [0x55e5ab6dc4ed]
#46  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7fe889b32083]
#47  /u/hilmes/dev/rasr_onnx_115/arch/linux-x86_64-standard/nn-trainer.linux-x86_64-standard(_start+0x2e) [0x55e5ab7016be]
albertz commented 11 months ago

So, on RETURNN/Python side, the last call before the crash is basically:

    def _handle_cmd_export_allophone_state_fsa_by_segment_name(self, segment_name):
        return self.callback("export_allophone_state_fsa_by_segment_name", segment_name)

Everything happens inside RASR then (callback is the Python API of RASR). Do we know which segment that is? Do you see that in the RASR log? Otherwise maybe add a print here to show it.

vieting commented 11 months ago

I just saw that I get the same segmentation fault also with the old tf2.8 image and RASR without TF. So maybe this is about some mismatch. With tf2.8 image and RASR compiled with that image, the example I created runs properly.

Marvin84 commented 11 months ago

@vieting what rasr binary are you using? Is this up to date? there was one memory leak bug once we integrated the FSA bug correction and got CTC topology under same subroutine. @SimBe195 did a correction for this a few months ago.

vieting commented 11 months ago

@vieting what rasr binary are you using? Is this up to date? there was one memory leak bug once we integrated the FSA bug correction and got CTC topology under same subroutine. @SimBe195 did a correction for this a few months ago.

You mean this here, right? https://github.com/rwth-i6/rasr/pull/47

The RASR without TF is from Bene on branch add_onnx_support with last commits from August 2023. The RASR for tf2.14 is the current GitHub main branch. Both have the commit from https://github.com/rwth-i6/rasr/pull/47. Also my RASR with tf2.8 has it.

albertz commented 11 months ago

What RASR/TF version combination did actually work before, and which do not?

vieting commented 11 months ago

My tf 2.8 image and RASR compiled with that image works. All other combinations do not work, including the tf 2.14 image from https://github.com/rwth-i6/rasr/pull/64 and RASR compiled with that image.

albertz commented 11 months ago

tf 2.8 image and RASR compiled with that image

And this is the same RASR version as in the other cases?

albertz commented 11 months ago

It seems like the RASR bug (causing seg fault) is fixed by https://github.com/rwth-i6/rasr/pull/50.