Conformer/Transformer has same initial param value in each layer

Now that the lazy init was removed (see #212, #215), all params are always created directly. E.g. in nn.Linear, this logic:

self.weight = nn.Parameter((nn.dim_match_priority_when_needed(self.in_dim, self.out_dim), self.out_dim))
self.weight.initial = nn.init.Glorot()

Setting Parameter.initial with a ParamInit type (Glorot) will directly call the ParamInit and then assign the corresponding tensor:

  @initial.setter
  def initial(self, value: Optional[Union[nn.Tensor, RawTensorTypes, nn.init.ParamInit]]):
    if isinstance(value, nn.init.ParamInit):
      value = value(shape=self.shape_ordered, dtype=self.dtype)
    ...

Now in Conformer and Transformer (TransformerEncoder, TransformerDecoder), we use copy.deepcopy on the layers/blocks. This effectively will copy the same Parameter.initial value for each layer. PyTorch actually has the problem, as I described here: https://github.com/rwth-i6/returnn_common/issues/109#issuecomment-1268566479, https://github.com/pytorch/pytorch/issues/86274

A potential solution is to not call the ParamInit directly in the initial setter but to delay it to some later point. Then a deepcopy would actually only copy the ParamInit object but not the tensor, and the the ParamInit will get called independently for each Parameter copy, so this would solve it. It's only a bit unclear when exactly it should be called. It could be in prepare_for_config_serialization but not sure. This is maybe an unexpected side effect when serializing the model.

rwth-i6 / returnn_common

Conformer/Transformer has same initial param value in each layer #216