Loss definition inside module causes `output` layer to be defined twice

JackTemaki commented 2 years ago

A loss.mark_as_loss() call inside a module causes the calling of

self.name_ctx.make_all_sub_networks_and_optimize()
ctx._make_sub_network_layer(ctx_.layer_ref)
nn.copy(sub_output, name=self.get_child("output"))

Which means that the "output" layer is already created there, thus causing the assert to fail when it is tried to create the same layer later again via nn.scoped.

Maybe this error already vanishes with #160 ?

And yes, I could define all losses outside the Modules, but @Atticus1806 and I discussed this and we prefer to have also losses inside, as especially for TTS it makes more sense to have them inside, because the number and type of losses depend on the model hierarchy.

Relevant Stacktrace of the Error:

  File "/u/rossenbach/experiments/tts_asr_2021/work/i6_core/returnn/training/ReturnnTrainingJob.alof7VBNpC1E/output/returnn.config", line 204, in get_network
    line: net = construct_network(epoch, **network_kwargs)
    locals:
      net = <not found>
      construct_network = <global> <function construct_network at 0x1491bbf47af0>
      epoch = <local> 1
      network_kwargs = <global> {'weight_decay': 0.1, 'net_module': <class 'i6_experiments.users.rossenbach.returnn.common_modules.asr_transformer.BLSTMDownsamplingTransformerASR'>, 'audio_data': Data{'audio_fea
tures', [B(-1),T|'audio_features_time'[B(-1)],F|F'audio_features_feature'(40)]}, 'label_data': Data{'bpe_labels', [B(-..., len = 8
  File "/u/rossenbach/experiments/tts_asr_2021/work/i6_core/returnn/training/ReturnnTrainingJob.alof7VBNpC1E/output/i6_experiments/users/rossenbach/returnn/common_modules/simple_asr_constructor.py", line 16, in 
construct_network
    line: out = net(
              audio_features=nn.get_extern_data(audio_data),
              labels=nn.get_extern_data(label_data),
              audio_time_dim=audio_time_dim,
              label_time_dim=label_time_dim,
              label_dim=label_dim,
          )
    locals:
      out = <not found>
      net = <local> <BLSTMDownsamplingTransformerASR>
      audio_features = <not found>
      nn = <global> <module 'returnn_common.nn' from '/u/rossenbach/experiments/tts_asr_2021/work/i6_core/tools/git/CloneGitRepositoryJob.BG3wpTzkBUD0/output/returnn_common/../returnn_common/nn/__init__.py'>
      nn.get_extern_data = <global> <function get_extern_data at 0x14915888b4c0>
      audio_data = <local> Data{'audio_features', [B(-1),T|'audio_features_time'[B(-1)],F|F'audio_features_feature'(40)]}
      labels = <not found>
      label_data = <local> Data{'bpe_labels', [B(-1),T|'bpe_labels_time'[B(-1)]], dtype='int32', sparse_dim=Dim{F'bpe_labels_indices'(2051)}, available_for_inference=False}
      audio_time_dim = <local> Dim{'audio_features_time'[?]}
      label_time_dim = <local> Dim{'bpe_labels_time'[?]}
      label_dim = <local> Dim{F'bpe_labels_indices'(2051)}
  File "/u/rossenbach/experiments/tts_asr_2021/work/i6_core/tools/git/CloneGitRepositoryJob.BG3wpTzkBUD0/output/returnn_common/../returnn_common/nn/naming.py", line 113, in Module.__call__
    line: nn.copy(out, name=name_ctx.get_child("output"))
    locals:
      nn = <global> <module 'returnn_common.nn' from '/u/rossenbach/experiments/tts_asr_2021/work/i6_core/tools/git/CloneGitRepositoryJob.BG3wpTzkBUD0/output/returnn_common/../returnn_common/nn/__init__.py'>
      nn.copy = <global> <function copy at 0x149158475a60>
      out = <local> <Tensor /'blstm_downsampling_transformer_asr'/'transformer'/'loop'/'output' [T|'bpe_labels_time'[B(-1)],B(-1),F|F'bpe_labels_indices'(2051)] via 'copy'>
      name = <local> None
      name_ctx = <local> <NameCtx /'blstm_downsampling_transformer_asr' [T|'bpe_labels_time'[B(-1)],B(-1)] module:<BLSTMDownsamplingTransformerASR>>
      name_ctx.get_child = <local> <bound method NameCtx.get_child of <NameCtx /'blstm_downsampling_transformer_asr' [T|'bpe_labels_time'[B(-1)],B(-1)] module:<BLSTMDownsamplingTransformerASR>>>
  File "/u/rossenbach/experiments/tts_asr_2021/work/i6_core/tools/git/CloneGitRepositoryJob.BG3wpTzkBUD0/output/returnn_common/../returnn_common/nn/_generated_layers.py", line 31, in copy
    line: return nn.make_layer({
            'class': 'copy',
            'from': source,
            }, name=name or 'copy')
    locals:
      nn = <global> <module 'returnn_common.nn' from '/u/rossenbach/experiments/tts_asr_2021/work/i6_core/tools/git/CloneGitRepositoryJob.BG3wpTzkBUD0/output/returnn_common/../returnn_common/nn/__init__.py'>
      nn.make_layer = <global> <function make_layer at 0x1491bbf7f790>
      source = <local> <Tensor /'blstm_downsampling_transformer_asr'/'transformer'/'loop'/'output' [T|'bpe_labels_time'[B(-1)],B(-1),F|F'bpe_labels_indices'(2051)] via 'copy'>
      name = <local> <NameCtx /'blstm_downsampling_transformer_asr'/'output' [T|'bpe_labels_time'[B(-1)],B(-1)]>
  File "/u/rossenbach/experiments/tts_asr_2021/work/i6_core/tools/git/CloneGitRepositoryJob.BG3wpTzkBUD0/output/returnn_common/../returnn_common/nn/base.py", line 730, in make_layer
    line: assert not name_ctx.layer_ref and not name_ctx.layer  # not yet assigned
    locals:
      name_ctx = <local> <NameCtx /'blstm_downsampling_transformer_asr'/'output' [T|'bpe_labels_time'[B(-1)],B(-1)]>
      name_ctx.layer_ref = <local> <Tensor /'blstm_downsampling_transformer_asr'/'output' [T|'bpe_labels_time'[B(-1)],B(-1)] via 'copy'>
      name_ctx.layer = <local> <Tensor /'blstm_downsampling_transformer_asr'/'output' [T|'bpe_labels_time'[B(-1)],B(-1)] via 'copy'>
AssertionError

Relevant Code:


class BLSTMDownsamplingTransformerASR(nn.Module):
    [...]    

    @nn.scoped
    def __call__([...]):
        [...]

        encoder_out, out_logits, out_labels, _ = self.transformer([...])

        loss = nn.sparse_softmax_cross_entropy_with_logits(
            logits=out_logits,
            targets=labels,
            axis=label_dim,
        )
        loss.mark_as_loss()

        return out_logits

And

def construct_network([...]):
    net = net_module([...])
    out = net([...])
    out.mark_as_default_output()
    [...]
    return net

albertz commented 2 years ago

A loss.mark_as_loss() call inside a module causes the calling of

self.name_ctx.make_all_sub_networks_and_optimize()

ctx._make_sub_network_layer(ctx_.layer_ref)

nn.copy(sub_output, name=self.get_child("output"))

Which means that the "output" layer is already created there, thus causing the assert to fail when it is tried to create the same layer later again via nn.scoped.

It looks like this is a bug of mark_as_loss, or maybe make_all_sub_networks_and_optimize.

Can you make a simple test case? In any case, this is sth which should always work.

Maybe this error already vanishes with #160 ?

Maybe but anyway make such test case that we always test this.

albertz commented 2 years ago

Do we have a test case now?

albertz commented 2 years ago

Note that #160 is merged now. While this might have resolved the particular error you were getting, I'm not sure though that the behavior of mark_as_loss is correct. Specifically, if you have a loss only for a sublayer, I'm not sure that RETURNN will really find it always. It might find it when it will create the subnetwork layer for other unrelated reasons, but of course you should not rely on this. We maybe should copy a similar logic as in mask_as_output, that it will have a reference (nn.copy) in the root always. For code in a loop, we need some extra care, like accumulating it automatically. I'm not sure what to do inside a cond.

albertz commented 2 years ago

Ok, I now pushed sth which should fix the loss in sublayer issue.

If you now run into other errors, please open new issues on that (maybe referencing this one here).

rwth-i6 / returnn_common

Loss definition inside module causes `output` layer to be defined twice #187