mlpen / Nystromformer

Apache License 2.0
354 stars 41 forks source link

Influence of the "conv_kernel_size" within the proposed Nystrom Attention #5

Open PkuRainBow opened 3 years ago

PkuRainBow commented 3 years ago

Congrats on your great work!

I am verifying your method on vision tasks and have a small concern on the influence of the "conv_kernel_size" of the 2D group-convolution in your task and I find that you choose relatively large numbers such as 35.

In vision tasks, applying the convolution with such a large kernel size is typically for ensuring a larger receptive field. Considering the proposed Nystrom attention already has the capability to model the long-range context following the original Multi-Head Attention. In summary, I am a little bit confused about the motivation of such a design.

Another important concern is that: _should we set the num_landmarks equal to the feature map width as the image feature maps are of grid structure?_

It would be great if you could share your advice on the influence of this parameter!

https://github.com/mlpen/Nystromformer/blob/effde255e6b38282840d0b4b620002579b32e4a2/code/attention_nystrom.py#L23-L28

https://github.com/mlpen/Nystromformer/blob/2bcc280c8cc3ab834e0c5ead2520a5872adbe348/LRA/code/lra_config.py#L46-L52

yyxiongzju commented 3 years ago

@PkuRainBow, Thanks for your interest. There are several deadlines recently. Sorry for the late reply.

The basic idea for this design, (a), it helps for faster training; (b), it helps capture the local details without requiring many landmarks on language modeling tasks. For vision task, we found that the local details are not that important comparing to language modeling tasks. I recommend you reduce the kernel size to a small number or without using it. For example, I directly run a trained model T2t-Vit-t-14 on ImageNet without retraining for inference by replacing the self-attention part in T2T module by Nystromformer with the kernel size = 0 and num_landmarks = 64 and it works pretty well, 78% top-1 accuracy. For Performer, it is 73.7%. For Linformer, it is 65.3%.

With respect to setting num_landmarks, it may depend on your tasks. When you use more landmarks, it will be more accurate to approximate standard self-attention. Based on my experience, num_landmarks = 64 works well for image classification. If you want higher accuracy, you can try to increase num_landmarks, e.g.128.

PkuRainBow commented 3 years ago

@yyxiongzju Thanks for your detailed explanation!

According to your comments, we guess that your method will be promising on the vision transformer tasks and we are wondering whether you have tried to retrain your method with DeiT or T2T-Vit by replacing all the MHSA with the Nystrom scheme.

In fact, I have tried to do such a change by replacing all the MHSA with the Nystrom scheme (based on DeiT for ImageNet classificaiton) but find the loss becomes NAN at the early stage. It would be great if you could share with me your comments!

yyxiongzju commented 3 years ago

@PkuRainBow, I did not retrain DeiT or T2T-ViT with the Nystrom scheme. But I did try it on object detection. It works well. I shared the code of using T2T-ViT with the Nystrom scheme by replacing all the MHSA.

I saw the NAN issue in the original T2T-ViT github repo. Can you run the code I shared to see if it works well?

PkuRainBow commented 3 years ago

@yyxiongzju Thanks for your reply! I will try the shared code soon.