google-research / big_vision

Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
Apache License 2.0
2.16k stars 147 forks source link

Question About Listed ViT Models in the configs/proj/flexivit/README.md #31

Closed mhamzaerol closed 1 year ago

mhamzaerol commented 1 year ago

Hello,

First of all, thank you very much for releasing many helpful materials and code samples of the interesting work FlexiVit.

When I went through the paper, the models referred to as ViT-B-16 and ViT-B-30 seems to be the baseline ViT models trained with the fixed patch sizes (16 and 30 respectively). Moreover, accordingly, their positional embedding sizes should be 15 and 8 if I am not wrong (img_size divided by the patch_size). However, when I downloaded and loaded the .npz files of these models from the README file, I encountered that the patch size and the positional embedding size were 32 and 7 which matches the setup of the flexiVit-B model mentioned in the paper but not that of the baseline ViT models (given my understanding).

Thus, I was curious whether the links map to the wrong models or I misunderstood the setup mentioned in the paper regarding these models.

Could you please help me with this matter. Thanks!

lucasb-eyer commented 1 year ago

Hi, thanks for your interest and the question!

You almost got it. For simplicity/uniformity of implementation, we also used the "underlying" patch and posemb sizes of 32 and 7 for the baseline models. Figures 17 (b) and (c) in the appendix show that this change has absolutely no effect on the results even for regular (not flexi) ViT models.

So, for the patch embeddings you can just resize them to 16 and 30 at load-time with PI-resize, and for the position embedding, resize them the usual way at load time, i.e. (bi)linear interpolation, the code does these here: https://github.com/google-research/big_vision/blob/main/big_vision/models/proj/flexi/vit.py#L198-L206

To be clear, I did not go and double-check the checkpoints now (as I think I did check when originally uploading them), so do let me know if they somehow don't work.

mhamzaerol commented 1 year ago

Hi,

When I carefully checked the relevant parts of the paper and the code portion you provided, now all makes sense to me. Thank you very much for the response.