Open rwightman opened 1 year ago
Checking out the list, I tried several flavors of vit/swin including lower-capacity swin models (doesn't fare as well as 384, even with interpolated pos emb). Overall the sporadic instability of vit at very high resolutions #9 is a bottleneck for trying out these models, but any feedback on these configurations will be helpful later on.
There are many possible vision arch to try other than the Donut choice of swin v1 or the common choice of vanilla (or modified) vit. Should make an effort to explore the options as we run experiments.
NOTE: exact weight instances TBD
vit_base_patch16_224.augreg_in21k
vit_base_patch16_clip_224.datacompxl
eva02_base_patch14_224.mim_in22k
convnext_base.clip_laiona_augreg_320
convnext_base.fb_in22k
Possibly