layer6ai-labs / dgm-eval

Codebase for evaluation of deep generative models as presented in Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models
MIT License
123 stars 9 forks source link

ConvNeXt: v1 vs v2 #2

Closed KohakuBlueleaf closed 1 year ago

KohakuBlueleaf commented 1 year ago

ConvNeXt V2 introduce FCMAE self sup pretrain and gain the performance for 0.5~1.5% top1 acc. And I'm suprised that the ConvNeXt perform so bad at this kind of task I'm wondering if ConvNeXtV2 has better result?

is it ok to provide some data related to ConvNeXtV2 (or some tutorial on how to measure the performance with same data/method you guys used in other models?)

georgestein commented 1 year ago

Thanks for the question!

We did not specifically study ConvNeXtV2 in our work, but as you point out ConvNeXtv1 performed quite poorly at aligning with human evaluation. It can also be seen in the GradCAM visualizations that it focuses on portions of the image that contain ImageNet classes and not a more wholistic view.

We did use the timm implementation of ConvNeXt that also includes V2 models (https://github.com/layer6ai-labs/dgm-eval/blob/master/dgm_eval/models/convnext.py), so changing to V2 (e.g. https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/convnext.py#L689) is straightforward. We release all generated datasets and the human error rate for each generative model, which is sufficient to perform an equivalent analysis of ConvNeXtv2.

While all data and code is available, your suggestion of a tutorial to measure performance is a good one! I will look into putting one together, and/or providing data related to ConvNeXtV2.

georgestein commented 1 year ago

Here is a quick look at the relation of human error rate and FD_ConvNeXt-v2, in the same format as Figure 4 of our paper

While there are some differences when comparing ConvNeXt-v1 to ConvNext-v2, FD in both representation spaces does not strongly correlate with human evaluation.

image