Closed albertz closed 1 year ago
Actually I think we can do both. Evaluate the ParamInit
directly, but still keep the ParamInit
object around, and the initial
getter would return it. That way, a copy of Parameter
would have different initial parameters.
Now that the lazy init was removed (see #212, #215), all params are always created directly. E.g. in
nn.Linear
, this logic:Setting
Parameter.initial
with aParamInit
type (Glorot
) will directly call theParamInit
and then assign the corresponding tensor:Now in
Conformer
andTransformer
(TransformerEncoder
,TransformerDecoder
), we usecopy.deepcopy
on the layers/blocks. This effectively will copy the sameParameter.initial
value for each layer. PyTorch actually has the problem, as I described here: https://github.com/rwth-i6/returnn_common/issues/109#issuecomment-1268566479, https://github.com/pytorch/pytorch/issues/86274A potential solution is to not call the
ParamInit
directly in theinitial
setter but to delay it to some later point. Then adeepcopy
would actually only copy theParamInit
object but not the tensor, and the theParamInit
will get called independently for eachParameter
copy, so this would solve it. It's only a bit unclear when exactly it should be called. It could be inprepare_for_config_serialization
but not sure. This is maybe an unexpected side effect when serializing the model.