Allow Activation Layer of MLP In SwinTransformer to be customizable

oke-aditya commented 2 years ago

🚀 The feature

Currently we do

self.mlp = MLP(dim, [int(dim * mlp_ratio), dim], activation_layer=nn.GELU, inplace=None, dropout=dropout)

But according to original implementation it is

https://github.com/microsoft/Swin-Transformer/blob/afeb877fba1139dfbc186276983af2abb02c2196/models/swin_transformer.py#L197


    def __init__(self, dim, input_resolution, num_heads, window_size=7, shift_size=0,
                 mlp_ratio=4., qkv_bias=True, qk_scale=None, drop=0., attn_drop=0., drop_path=0.,
                 act_layer=nn.GELU, norm_layer=nn.LayerNorm,
                 fused_window_process=False):

Motivation, pitch

I think it might be better to align with the original implementation by creating extra init param activation_layer and allowing the flexibility.

Alternatives

No response

Additional context

No response

datumbox commented 2 years ago

@oke-aditya Thanks for the report. I believe this simplification is intentional.

Research repositories often contain lots of extra configurations used in the ablation experiments. Before adding them to TorchVision, we typically simplify them by keeping those configurations that are relevant for the canonical variants. On the original paper at section 3.1 we can see that the MLP is supposed to have a GELU activation. This remains constant across all variants, which is why we don't offer a configuration for it. Have you spotted any variable that makes use of a different activation? If yes, could you provide some references?

oke-aditya commented 2 years ago

Hi @datumbox I went over the SwinTransformer Paper. And I could not find any other place where MLP needs activation other than nn.GELU. But I suppose we can allow this flexiblity still.

If you look at the original implementation, it does. Also the huggingface implementation has it has flexiblity too https://github.com/huggingface/transformers/blob/main/src/transformers/models/swin/modeling_swin.py#L564 and see https://github.com/huggingface/transformers/blob/main/src/transformers/activations.py#L144

Probably it's a good simplification, but we can add this flexibility easily without BC breaking.

datumbox commented 2 years ago

@oke-aditya Thanks for the references.

What are the specific use-cases we are trying to address? In which cases, why do we want different versions of the activation function? I'm asking because quite often, reference implementations make the decision to have a specific part of the model configurable to make it possible to run ablation experiments. While porting models, we usually clip all configurations that were originally used for ablation experiments to simplify the code. There are literally hundreds of things we can make configurable in each model but we try to keep the architecture design simple and as close to the paper as possible. Of course, there are cases were we do introduce configurations even for things that the paper doesn't really account for. Good examples for this are being able to change normalization layers or supporting dilation for Object Detection and Segmentation backbones. If we have a specific use-case where the use of something other than GELU is necessary, then it's worth considering it. But if not, I recommend to wait until there is a specific need we try to address. Users who just want to experiment with different activation methods can use a model surgery approach similar to how we replace in-place ReLUs on Quantization (see here).

Happy to talk more about this. Let me know your thoughts.

oke-aditya commented 2 years ago

While this was just an inconsistency I spotted with other implementations, it has opened a lot more questions than expected.

What are the specific use-cases we are trying to address?

I think ablation study, or ease of modification. This is just a small inconsistency, nothing special as such.

quite often, reference implementations make the decision to have a specific part of the model configurable to make it possible to run ablation experiments

So yes, they do these for model to be configurable and extendable. It maybe because the authors were too trying to figure out things. And I agree that so many blocks can be changed as per customization and requirements.

Other than GELU I cannot find any use case as of now, but not sure, we can wait for researchers :)

pytorch / vision