Open gudrb opened 4 months ago
Hi @gudrb , thanks for your attention to our work!
In Mini-DeiT, the transformation for MLP is the relative position encoding https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L117
In Mini-Swin, the transformation for MLP is the depth-wise convolution layer https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-Swin/models/swin_transformer_minivit.py#L275
On the MiniViT paper,
We make several modifications on DeiT: First, we remove the [class] token. The model is attached with a global average pooling layer and a fully-connected layer for image classification. We also utilize relative position encoding to introduce inductive bias to boost the model convergence [52,59]. Finally, based on our observation that transformation for FFN only brings limited performance gains in DeiT, we remove the block to speed up both training and inference.
-> Does this mean that in MiniDeiT model, IRPE is utilized (for the value), and the MLP transformation is removed, leaving only the attention transformation?
On the MiniViT paper,
We make several modifi�cations on DeiT: First, we remove the [class] token. The model is attached with a global average pooling layer and a fully-connected layer for image classification. We also utilize relative position encoding to introduce inductive bias to boost the model convergence [52,59]. Finally, based on our observation that transformation for FFN only brings limited performance gains in DeiT, we remove the block to speed up both training and inference.
-> Does this mean that in MiniDeiT model, IRPE is utilized (for the value), and the MLP transformation is removed, leaving only the attention transformation?
Yes. I correct my statement. There is no transformation for FFN in Mini-DeiT. iRPE is utilized for only the key. https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L97
Hello,
I have a question regarding the implementation of layer normalization in the MiniViT paper and the corresponding code. Specifically, I am referring to how layer normalization is applied between transformer blocks.
In the MiniViT paper, it is mentioned that layer normalization between transformer blocks is not shared, and I believe the code reflects this. However, I am confused about how the RepeatedModuleList applies layer normalization multiple times and how it ensures that the normalizations are not shared.
Here is the relevant code snippet for the MiniBlock class: https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L144
thank you.
Hi @gudrb ,
The following code creates a list of LayerNorm, where the number of LayerNorm is repeated_times
.
https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L145-L146
RepeatedModuleList will select the self._repeated_id
-th LayerNorm to forward.
In RepeatedMiniBlock
, _repeated_id
is updated. Therefore, each LayerNorm, conv and RPE are executed once but other modules are executed multiple times.
Hello,
Thank you for your kind reply.
I noticed that Relative Position Encoding (RPE) is applied only on the key value. In the MiniViT paper, I couldn't see the explicit application of it in the equations.
Does this mean that (K^T_m) already represents the image with the relative position applied (using the piecewise function, product method, contextual mode, and unshared)?
Thank you!
Hi @gudrb , here is the application of the weight transformation.
In the equations provided in the MiniViT paper, is
K_m^T
actually representing (K'_m + r_m)^T
, where r
are trainable positional identifiers? In the code, iRPE is used, but the notation is not explicitly shown in the equations from the paper. Could you confirm if this interpretation is correct?
In the equation 7, we ignore the relative position encoding. The iRPE is only applied on Mini-DeiT.
Hello, I have a question about the transformations in the MiniViT paper.
I could find the first transformation (implemented in the MiniAttention class) in the code: https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L104
However, i couldn't find the second transformation in the code (which should be before or inside the MLP in the MiniBlock class) https://github.com/microsoft/Cream/blob/4a13c4091e78f9abd2160e7e01c02e48c1cf8fb9/MiniViT/Mini-DeiT/mini_vision_transformer.py#L137
Could you please let me know where the second transformation is?