rwth-i6 / returnn_common

Common building blocks for RETURNN configs, such as models, training concepts, etc
7 stars 4 forks source link

Conformer/Transformer has same initial param value in each layer #216

Closed albertz closed 1 year ago

albertz commented 1 year ago

Now that the lazy init was removed (see #212, #215), all params are always created directly. E.g. in nn.Linear, this logic:

self.weight = nn.Parameter((nn.dim_match_priority_when_needed(self.in_dim, self.out_dim), self.out_dim))
self.weight.initial = nn.init.Glorot()

Setting Parameter.initial with a ParamInit type (Glorot) will directly call the ParamInit and then assign the corresponding tensor:

  @initial.setter
  def initial(self, value: Optional[Union[nn.Tensor, RawTensorTypes, nn.init.ParamInit]]):
    if isinstance(value, nn.init.ParamInit):
      value = value(shape=self.shape_ordered, dtype=self.dtype)
    ...

Now in Conformer and Transformer (TransformerEncoder, TransformerDecoder), we use copy.deepcopy on the layers/blocks. This effectively will copy the same Parameter.initial value for each layer. PyTorch actually has the problem, as I described here: https://github.com/rwth-i6/returnn_common/issues/109#issuecomment-1268566479, https://github.com/pytorch/pytorch/issues/86274

A potential solution is to not call the ParamInit directly in the initial setter but to delay it to some later point. Then a deepcopy would actually only copy the ParamInit object but not the tensor, and the the ParamInit will get called independently for each Parameter copy, so this would solve it. It's only a bit unclear when exactly it should be called. It could be in prepare_for_config_serialization but not sure. This is maybe an unexpected side effect when serializing the model.

albertz commented 1 year ago

Actually I think we can do both. Evaluate the ParamInit directly, but still keep the ParamInit object around, and the initial getter would return it. That way, a copy of Parameter would have different initial parameters.