Error when setting attention weights layer as output layer

robin-p-schmitt commented 2 years ago

When using the following network dict: https://gist.github.com/robin-p-schmitt/a63bbfd3870935b78c86328a38fae783, I am getting the following error:

  File "/u/schmitt/experiments/transducer/recipe/i6_experiments/users/schmitt/experiments/swb/transducer/tools/dump_attention_weights.py", line 109, in init
    line: rnn.engine.init_network_from_config(net_dict_post_proc=net_dict_add_losses)
    locals:
      rnn = <global> <module 'returnn.__main__' from '/u/schmitt/src/returnn/returnn/__main__.py'>
      rnn.engine = <global> <returnn.tf.engine.Engine object at 0x7ff5f5a47a00>
      rnn.engine.init_network_from_config = <global> <bound method Engine.init_network_from_config of <returnn.tf.engine.Engine object at 0x7ff5f5a47a00>>
      net_dict_post_proc = <not found>
      net_dict_add_losses = <global> <function net_dict_add_losses at 0x7ff5f9e28c10>
...
  File "/u/schmitt/src/returnn/returnn/tf/layers/base.py", line 1342, in LayerBase.get_losses_initialized
    line: return self.__class__.get_losses(reduce_func=reduce_func, layer=self, **self.kwargs)
    locals:
      self = <local> <RecLayer 'label_model' out_type=Data{[T|'label_ground_truth_masked0:masked:time'[B],B], dtype='int32', sparse_dim=Dim{F'label_ground_truth_masked:set-sparse-dim'(1031)}}>
      self.__class__ = <local> <class 'returnn.tf.layers.rec.RecLayer'>
      self.__class__.get_losses = <local> <bound method RecLayer.get_losses of <class 'returnn.tf.layers.rec.RecLayer'>>
      reduce_func = <local> None
      layer = <not found>
      self.kwargs = <local> {'back_prop': True, 'include_eos': True, 'is_output_layer': True, 'name_scope': 'output/rec', '_network': <TFNetwork '' train=<tf.Tensor 'globals/train_flag:0' shape=() dtype=bool>>, '_name': 'label_model', 'n_out': <class 'returnn.util.basic.NotSpecified'>, 'sources': [<SourceLayer 'data:label_g..., len = 13
  File "/u/schmitt/src/returnn/returnn/tf/layers/rec.py", line 725, in RecLayer.get_losses
    line: for loss in template_layer.layer_class_type.get_losses(reduce_func=reduce_func, **template_layer.kwargs):
    locals:
      loss = <local> None
      template_layer = <local> <_TemplateLayer(LengthLayer)(:template:length) label_model/':dyn-tag-accum:1:att_weights' out_type=Data{[B], dtype='int32', ctx=loop('label_ground_truth_masked0:masked:time'[B])} (construction stack None)>
      template_layer.layer_class_type = <local> <class 'returnn.tf.layers.basic.LengthLayer'>
      template_layer.layer_class_type.get_losses = <local> <bound method LayerBase.get_losses of <class 'returnn.tf.layers.basic.LengthLayer'>>
      reduce_func = <local> None
      template_layer.kwargs = <local> {'axis': Dim{'att_t'[B]{ctx=loop('label_ground_truth_masked0:masked:time'[B])}}, '_network': <TFNetwork '/label_model(rec-subnet)' parent_layer=<RecLayer 'label_model' out_type=Data{[T|'label_ground_truth_masked0:masked:time'[B],B], dtype='int32', sparse_dim=Dim{F'label_ground_truth_masked:set-sp..., len = 6
TypeError: get_losses() missing 1 required positional argument: 'network'

The full error log can also be seen in the Gist above.

The error seems to be caused by setting is_output_layer=True in the attention weights layer in combination with having prev:att as input to the lm layer. The relevant layers therefore are:

"lm": {
                "class": "rec",
                "from": ["input_embed", "prev:att"],
                "n_out": 1024,
                "name_scope": "lm/rec",
                "unit": "nativelstm2",
            },
"att_query": {
                "activation": None,
                "class": "linear",
                "from": "lm",
                "is_output_layer": False,
                "n_out": 1024,
                "with_bias": False,
            },
"att_energy_in": {
                "class": "combine",
                "from": ["att_ctx", "att_query"],
                "kind": "add",
                "n_out": 1024,
            },
"energy_tanh": {
                "activation": "tanh",
                "class": "activation",
                "from": ["att_energy_in"],
            },
"att_energy0": {
                "activation": None,
                "class": "linear",
                "from": ["energy_tanh"],
                "n_out": 1,
                "name_scope": "energy",
                "with_bias": False,
            },
"att_energy": {
                "class": "reinterpret_data",
                "from": "att_energy0",
                "is_output_layer": False,
                "set_dim_tags": {
                    "f": Dim(
                        kind=Dim.Types.Spatial, description="att_heads", dimension=1
                    )
                },
            },
"att_weights0": {
                "axis": "stag:att_t",
                "class": "softmax_over_spatial",
                "energy_factor": 0.03125,
                "from": "att_energy",
            },
"att_weights": {
                "class": "dropout",
                "dropout": 0.0,
                "dropout_noise_shape": {"*": None},
                "from": "att_weights0",
                "is_output_layer": False,
            },
"att0": {
                "add_var2_if_empty": False,
                "class": "dot",
                "from": ["att_val_split", "att_weights"],
                "reduce": "stag:att_t",
                "var1": "f",
                "var2": None,
            },
"att": {"axes": "except_time", "class": "merge_dims", "from": "att0"},

In my script, which I use to initialize my network, I call rnn.engine.init_network_from_config(net_dict_post_proc=net_dict_add_losses). The function net_dict_add_losses, despite its name, only sets net_dict["label_model"]["unit"]["att_weights"]["is_output_layer"] = True. When looking at the log, one can see that the network construction is working before setting this and only fails in the second pass. Therefore it must have something to do with setting att_weights as an output layer.

I was not yet able to reproduce the error with a smaller network but maybe the error can still be found.

albertz commented 2 years ago

I was not yet able to reproduce the error with a smaller network but maybe the error can still be found.

I don't exactly understand. There are many things which can trivially be removed to make the network smaller. E.g. SpecAugment, making only a single layer for the encoder (or zero layers, remove all the LSTMs), making smaller dimensions, etc. All this steps would already have been helpful.

Edit See how I trivially reduced the test case in #1028. This was just by doing what I described, and also removing unused layers. So all without even thinking much about it. It probably can be reduced much more but at least so far this was really trivial.

albertz commented 2 years ago

Are you working on this now? Edit I think I fixed it. Wait for the PR. Edit See #1028.

How urgent is this to be fixed for your work? Do you found a workaround? Edit Maybe just wait until #1028 is merged.

albertz commented 2 years ago

I realized that the way you use dim tags here is wrong. You are creating multiple dim tags with the same description (e.g. att_heads) and I think you want them to be equal, but when you creating multiple separate instances, they are in fact not equal. In #1222, this will be fixed, and thus the test case here will also be fixed.

rwth-i6 / returnn

Error when setting attention weights layer as output layer #1027