ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

lisosia commented 1 year ago

一言でいうと

MAE (Masked Auto Encoding) によるpre-training に対応した ConvNext

論文リンク

https://arxiv.org/abs/2301.00808

著者/所属機関

KAIST, Meta AI and New York University

投稿日付(yyyy/MM/dd)

2023/01

概要

CNNで MAE方式の pretrain を行えるようにした。

mask した場所は見えないように畳み込みする必要があるが、これは sparse conv で実現。挙動としては、maskした部分は足し算せずに畳み込みするだけ。

これだけだとうまくいかないが、GRN layer でノーマライズして feature collapse（feature群が飽和・逆飽和した状態）を防ぐとうまくいく。

モデル構造：CNN後＋Pretrain時のみの小さなdecoder
feature collapse
GRN

新規性・差分

手法

結果

https://github.com/facebookresearch/ConvNeXt-V2/issues/ 9#issuecomment-1373354812 For the experiments in our paper, we used JAX and TPU for pre-training (see some notes in appendix A.2). This pytorch release relies on an external library (MinkowskiEngine) for the Sparse Conv implementation, the cuda kernel is not optimized (thus the 50% GPU utilization) which makes pre-training slower.

lisosia commented 1 year ago

https://huggingface.co/timm/convnextv2_tiny.fcmae にv1とv2の比較があった。 in22k_*in1k で grep すると以下で、おおむね v2の方が良くなっているように見える。

ただし、速度計測は eager model PyTorch 1.13 on RTX 3090 w/ AMP. のため、 TensorRTでデプロイしたときなどにどうなるかは未知数。

model   top1    top5    img_size    param_count gmacs   macts   samples_per_sec batch_size
convnextv2_huge.fcmae_ft_in22k_in1k_512 88.848  98.742  512 660.29  600.81  413.07  28.58   48
convnextv2_huge.fcmae_ft_in22k_in1k_384 88.668  98.738  384 660.29  337.96  232.35  50.56   64
convnextv2_large.fcmae_ft_in22k_in1k_384    88.196  98.532  384 197.96  101.1   126.74  128.94  128
convnext_xlarge.fb_in22k_ft_in1k_384    87.75   98.556  384 350.2   179.2   168.99  124.85  192
convnextv2_base.fcmae_ft_in22k_in1k_384 87.646  98.422  384 88.72   45.21   84.49   209.51  256
convnext_large.fb_in22k_ft_in1k_384 87.476  98.382  384 197.77  101.1   126.74  194.66  256
convnextv2_large.fcmae_ft_in22k_in1k    87.26   98.248  224 197.96  34.4    43.13   376.84  256
convnext_xlarge.fb_in22k_ft_in1k    87.002  98.208  224 350.2   60.98   57.5    368.01  256
convnext_base.fb_in22k_ft_in1k_384  86.796  98.264  384 88.59   45.21   84.49   366.54  256
convnextv2_base.fcmae_ft_in22k_in1k 86.74   98.022  224 88.72   15.38   28.75   624.23  256
convnext_large.fb_in22k_ft_in1k 86.636  98.028  224 197.77  34.4    43.13   581.43  256
convnext_base.fb_in22k_ft_in1k  85.822  97.866  224 88.59   15.38   28.75   1037.66 256
convnext_small.fb_in22k_ft_in1k_384 85.778  97.886  384 50.22   25.58   63.37   518.95  256
convnextv2_tiny.fcmae_ft_in22k_in1k_384 85.112  97.63   384 28.64   13.14   39.48   491.32  256
convnext_small.fb_in22k_ft_in1k 84.562  97.394  224 50.22   8.71    21.56   1478.29 256
convnext_tiny.fb_in22k_ft_in1k_384  84.084  97.14   384 28.59   13.14   39.48   862.95  256
convnextv2_tiny.fcmae_ft_in22k_in1k 83.894  96.964  224 28.64   4.47    13.44   1452.72 256
convnextv2_nano.fcmae_ft_in22k_in1k_384 83.37   96.742  384 15.62   7.22    24.61   801.72  256
convnext_tiny.fb_in22k_ft_in1k  82.898  96.616  224 28.59   4.47    13.44   2480.88 256
convnextv2_nano.fcmae_ft_in22k_in1k 82.03   96.166  224 15.62   2.46    8.37    2300.18 256

lisosia commented 1 year ago

https://github.com/facebookresearch/ConvNeXt-V2 This project is released under the MIT license except ImageNet pre-trained and fine-tuned models which are licensed under a CC-BY-NC. Please see the LICENSE file for more information.

weight は非商用なので注意。

lisosia / cv_knowledge