huggingface / open-muse

Open reproduction of MUSE for fast text2image generation.
https://huggingface.co/openMUSE
Apache License 2.0
334 stars 27 forks source link

GC-VIT #62

Closed isamu-isozaki closed 1 year ago

isamu-isozaki commented 1 year ago

Add in new nvidia's SOTA VIT. From here. The original code is non-commercial but the timm variant linked above is available for us.

isamu-isozaki commented 1 year ago

This seems very similar to Max VIT. I think a similar idea but different architecture/conv choices+hyper parameters. However, it doesn't seem to compare that differently in performance from max-vit

On imagenet 1k, this model has a classification accuracy of 85.6% with 201 M params while with maxvit the accuracy for a 212 M model can go to 85.17%.

Also, this architecture doesn't seem to be tested on image sizes above 224^2 where max-vit can reach an accuracy of 86.7 at resolution 512.

Overall, there's no urgency to add this for now I think

isamu-isozaki commented 1 year ago

I'll close this for now as Max VIT might be enough