Closed isamu-isozaki closed 1 year ago
This seems very similar to Max VIT. I think a similar idea but different architecture/conv choices+hyper parameters. However, it doesn't seem to compare that differently in performance from max-vit
On imagenet 1k, this model has a classification accuracy of 85.6% with 201 M params while with maxvit the accuracy for a 212 M model can go to 85.17%.
Also, this architecture doesn't seem to be tested on image sizes above 224^2 where max-vit can reach an accuracy of 86.7 at resolution 512.
Overall, there's no urgency to add this for now I think
I'll close this for now as Max VIT might be enough
Add in new nvidia's SOTA VIT. From here. The original code is non-commercial but the timm variant linked above is available for us.