Closed KohakuBlueleaf closed 1 year ago
Thanks for the question!
We did not specifically study ConvNeXtV2 in our work, but as you point out ConvNeXtv1 performed quite poorly at aligning with human evaluation. It can also be seen in the GradCAM visualizations that it focuses on portions of the image that contain ImageNet classes and not a more wholistic view.
We did use the timm implementation of ConvNeXt that also includes V2 models (https://github.com/layer6ai-labs/dgm-eval/blob/master/dgm_eval/models/convnext.py), so changing to V2 (e.g. https://github.com/huggingface/pytorch-image-models/blob/main/timm/models/convnext.py#L689) is straightforward. We release all generated datasets and the human error rate for each generative model, which is sufficient to perform an equivalent analysis of ConvNeXtv2.
While all data and code is available, your suggestion of a tutorial to measure performance is a good one! I will look into putting one together, and/or providing data related to ConvNeXtV2.
Here is a quick look at the relation of human error rate and FD_ConvNeXt-v2, in the same format as Figure 4 of our paper
While there are some differences when comparing ConvNeXt-v1 to ConvNext-v2, FD in both representation spaces does not strongly correlate with human evaluation.
ConvNeXt V2 introduce FCMAE self sup pretrain and gain the performance for 0.5~1.5% top1 acc. And I'm suprised that the ConvNeXt perform so bad at this kind of task I'm wondering if ConvNeXtV2 has better result?
is it ok to provide some data related to ConvNeXtV2 (or some tutorial on how to measure the performance with same data/method you guys used in other models?)