Closed Atticus1806 closed 1 year ago
Sorry, I missed the meeting. What was the reasoning on this? This changes the code style away from standard PyTorch code style. Why do you want that? I think this is a bad idea.
Also, you now don't allow a callable function anymore, like things from torch.nn.functional
. Why would you want that change?
It caused parameters to be shared on all places where layernorm was used.
Sorry, I missed the meeting. What was the reasoning on this?
Expecting a nn.Module
can lead to bugs because the Module instance would only be created once when setting up the config and then passed to each layer that uses it instead of each layer having an independent instance of the module.
It caused parameters to be shared on all places where layernorm was used.
Yes people usually do a deepcopy
somewhere to avoid this. E.g. see the official PyTorch Transformer
implementation.
This changes the code style away from standard PyTorch code style. Why do you want that? I think this is a bad idea.
What is standard PyTorch code style? As I understand standard would mean you just define the modules as is. We decided against that for modularity, but now we saw that how we did it is not a good approach. You mention the PyTorch Transformer, but there LayerNorm is hard-coded!!!
https://github.com/pytorch/pytorch/blob/main/torch/nn/modules/transformer.py#L503
Only the final norm of the encoder/decoder is configurable, but this appears only once. For the Layers itself you pass layer_norm_eps
as argument, so this is clearly something we decided against.
I'm not speaking about the norm in PyTorch Transformer, I speak about the generic code style convention to pass an instance of a nn.Module
around, and not such a factory function. This is what you see for example in PyTorch Transformer, in TransformerEncoder
the encoder_layer
is passed as-is, and then it makes copies of that using deepcopy
such that you don't have shared parameters. I actually also explained that in the last meeting when we decided to use this code style, that you need to do a deepcopy
somewhere. I guess this was forgotten.
This generic code style convention to pass an instance of a nn.Module
around, you see that quite often in PyTorch code. Although I was checking a bit how this problem is avoided, and I found that either you have some deepcopy
somewhere, or at the layer level people make sure that you create new instances and not use the shared instance.
Ah your example was meant for layer copy. Well, as you can pass any callable through deepcopy I guess we can actually do that (for the norm). If any nn.Module is generally safely deepcopy-able that is.
So we discuss this on Thursday?
As agreed on today.