google-research / big_vision

Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
Apache License 2.0
2.16k stars 147 forks source link

question about FlexiViT #30

Closed KimWu1994 closed 9 months ago

KimWu1994 commented 1 year ago

FlexiViT is a very imaginative work. I was also bothered by the flexible patch size. I want to know how to implement PI-resize in Section 3.4 in the code. And how to optimize the PI-resize in training.

lucasb-eyer commented 1 year ago

Hi, thanks for your interest!

The implementation of PI-resize during training is here: https://github.com/google-research/big_vision/blob/main/big_vision/models/proj/flexi/vit.py#L30-L75

In words: PI-resize does not introduce any new trainable parameters. You define some learnable parameter for the patch-embedding just like in regular ViT: pick any patch-size, doesn't really matter what, we use 32x32, so allocate a 32x32x3x[model-dim] buffer. Then, before passing that to the conv operation for patch-embedding, multiply it with the PI-resize matrix. That matrix can be computed analytically once at the start and is not trained, see code pointer above.

I'm not sure what loss you mean - there is no need to change whatever loss you are using when "flexifying" your training loop.