google-research / big_vision

Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
Apache License 2.0
2.04k stars 140 forks source link

Reproduced result for flexivit #77

Closed liguopeng0923 closed 5 months ago

liguopeng0923 commented 7 months ago

Hi,

I want to know how to reproduce the results of your teaser in Flexivit.

image

An image is split into 2*2, and the accuracy is 84.4%.

Best, Guopeng.

akolesnikoff commented 7 months ago

The teaser pic is just an illustration of the overall idea, showing how to train a single model with different patch sizes.

I believe the actual patch sizes are 24 and 4 respectively, which corresponds to many more tokens.

liguopeng0923 commented 7 months ago

Fine, thanks.

Can you provide the trained models with normalization like ImageNet? mean = [0.485, 0.456, 0.406] and std = [0.229, 0.224, 0.225]

This is very important for our later work.