microsoft / Cream

This is a collection of our NAS and Vision Transformer work.
MIT License
1.61k stars 220 forks source link

transformations in MiniViT paper #224

Open gudrb opened 4 months ago

gudrb commented 4 months ago

Hello, I have a question about the transformations in the MiniViT paper.

I could find the first transformation (implemented in the MiniAttention class) in the code: https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L104

However, i couldn't find the second transformation in the code (which should be before or inside the MLP in the MiniBlock class) https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L137

Could you please let me know where the second transformation is?

wkcn commented 4 months ago

Hi @gudrb , thanks for your attention to our work!

In Mini-DeiT, the transformation for MLP is the relative position encoding https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L117

In Mini-Swin, the transformation for MLP is the depth-wise convolution layer https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-Swin/models/swin_transformer_minivit.py#L275

gudrb commented 4 months ago

On the MiniViT paper,

We make several modifications on DeiT: First, we remove the [class] token. The model is attached with a global average pooling layer and a fully-connected layer for image classification. We also utilize relative position encoding to introduce inductive bias to boost the model convergence [52,59]. Finally, based on our observation that transformation for FFN only brings limited performance gains in DeiT, we remove the block to speed up both training and inference.

-> Does this mean that in MiniDeiT model, IRPE is utilized (for the value), and the MLP transformation is removed, leaving only the attention transformation?

wkcn commented 4 months ago

On the MiniViT paper,

We make several modifi�cations on DeiT: First, we remove the [class] token. The model is attached with a global average pooling layer and a fully-connected layer for image classification. We also utilize relative position encoding to introduce inductive bias to boost the model convergence [52,59]. Finally, based on our observation that transformation for FFN only brings limited performance gains in DeiT, we remove the block to speed up both training and inference.

-> Does this mean that in MiniDeiT model, IRPE is utilized (for the value), and the MLP transformation is removed, leaving only the attention transformation?

Yes. I correct my statement. There is no transformation for FFN in Mini-DeiT. iRPE is utilized for only the key. https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L97

https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_deit_models.py#L17

gudrb commented 2 days ago

Hello,

I have a question regarding the implementation of layer normalization in the MiniViT paper and the corresponding code. Specifically, I am referring to how layer normalization is applied between transformer blocks.

In the MiniViT paper, it is mentioned that layer normalization between transformer blocks is not shared, and I believe the code reflects this. However, I am confused about how the RepeatedModuleList applies layer normalization multiple times and how it ensures that the normalizations are not shared.

Here is the relevant code snippet for the MiniBlock class: https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L144

thank you.

wkcn commented 1 day ago

Hi @gudrb ,

The following code creates a list of LayerNorm, where the number of LayerNorm is repeated_times. https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L145-L146

RepeatedModuleList will select the self._repeated_id-th LayerNorm to forward.

https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L28-L29

In RepeatedMiniBlock, _repeated_id is updated. Therefore, each LayerNorm, conv and RPE are executed once but other modules are executed multiple times.

https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L174-L180

gudrb commented 1 day ago

Hello,

Thank you for your kind reply.

I noticed that Relative Position Encoding (RPE) is applied only on the key value. In the MiniViT paper, I couldn't see the explicit application of it in the equations.

20240702_164802 Does this mean that (K^T_m) already represents the image with the relative position applied (using the piecewise function, product method, contextual mode, and unshared)?

Thank you!

wkcn commented 9 hours ago

Hi @gudrb , here is the application of the weight transformation.

https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L103-L109

gudrb commented 8 hours ago

20240702_164802 In the equations provided in the MiniViT paper, is K_m^T actually representing (K'_m + r_m)^T, where r are trainable positional identifiers? In the code, iRPE is used, but the notation is not explicitly shown in the equations from the paper. Could you confirm if this interpretation is correct?

wkcn commented 8 hours ago

In the equation 7, we ignore the relative position encoding. The iRPE is only applied on Mini-DeiT.