google-research / big_vision

Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
Apache License 2.0
2.25k stars 147 forks source link

Confusion on FlexiViT #94

Open zilunzhang opened 6 months ago

zilunzhang commented 6 months ago

Hi, thanks for bringing us such great work! I have two questions regarding the paper.

  1. The PI-resize method does not introduce any learnable parameter, it should be compatible with any ViT model. Therefore, we can use the PI-resize in a zero-shot manner? Then, what's the point of training the FlexiViT? I know since the patch size can be (almost) any number with PI-resize, we can transfer the knowledge of ViT-8 through distillation. But is there any difference between training a FlexiViT and using PI-resize directly in the ViT-8 model (without training)? In Figure 3, the authors mentioned that "Standard ViTs (ViT-16/ViT-30) are not flexible", but the authors "simply resize the patch embedding weights ω and the position embeddings π with bilinear interpolation", not PI.

  2. Will the weight of FlexiCLIP be released someday?

Thanks, I am really looking forward to the answers!

Best,

Zilun