keyu-tian / SparK

[ICLR'23 Spotlight🔥] The first successful BERT/MAE-style pretraining on any convolutional network; Pytorch impl. of "Designing BERT for Convolutional Networks: Sparse and Hierarchical Masked Modeling"
https://arxiv.org/abs/2301.03580
MIT License
1.41k stars 82 forks source link

小模型自监督效果 #59

Closed leoxxxxxD closed 9 months ago

leoxxxxxD commented 10 months ago

我们尝试了小模型自监督的效果,结果不如有监督训练,gold-yolo这篇论文也是类似的结论,模型越小提升幅度越小,您对小模型用spark方法有什么看法吗?

keyu-tian commented 10 months ago
  1. 太小的模型(或者有一些特殊operator的模型)可能不太能从mask modeling中受益很多,因为他们的supervised pretraining可能还是欠拟合状态,也就不太能提现self-supervised 优势了
  2. 也有可能是supervised和self-supervised checkpoint在具体下游任务上finetune的最优超参不一样
  3. 也可能与下游任务的类型有关
leoxxxxxD commented 9 months ago

再请教一个问题,小模型在自监督训练的时候,会出现loss突然增大的情况,您有遇到过类似情况吗? "cur_ep": "28/1600", "last_L": 0.5888837552416637 "cur_ep": "29/1600", "last_L": 0.5947143313255203 "cur_ep": "30/1600", "last_L": 0.8912972437866619 "cur_ep": "31/1600", "last_L": 0.6176579332809323 "cur_ep": "32/1600", "last_L": 0.5972666382733802 "cur_ep": "33/1600", "last_L": 0.5940532513269771 "cur_ep": "34/1600", "last_L": 0.5805482207277741

keyu-tian commented 9 months ago

印象中没有遇到。用fp16了吗?我猜测也有可能是batchsize或learning rate过大

leoxxxxxD commented 9 months ago

没有用fp16,batchsize是1000左右,比默认的4096小,learning rate是您的代码里面计算得到的

leoxxxxxD commented 9 months ago

可以开源下您resnet50的训练日志吗

keyu-tian commented 9 months ago

如果您dataset显著比imagenet小,batchsize1000可能对dataset来说过大

这是 1600 epoch ResNet50 预训练阶段每个 iteration 的 loss 情况 image

leoxxxxxD commented 9 months ago

您有对比过400或者800轮相比1600轮的效果吗

keyu-tian commented 9 months ago

可见我们paper里的ablation部分,另外您或许可以调整--base_lr=1e-4,我们的默认值2e-4或许对您dataset来说过大

leoxxxxxD commented 9 months ago

数据集同样使用的imagenet,只是模型是mobilenet级别的网络,学习率还建议调小吗

keyu-tian commented 9 months ago

不是很确定,或许可以尝试调整一下变大变小;另外如果网络中有特殊算子,可能需要手动定义一下它的sparse形式,因为 https://github.com/keyu-tian/SparK/blob/main/pretrain/encoder.py#L39-L110 中只定义了conv2d,maxpooling,avgpooling,bn2d,syncbn,layernorm的sparse形式。例如如果网络中间有linear层也是需要定义一下sparse的(因为输入中有0,经过linear之后0+bias变成非0了,需要对output进行mask归0)

puyiwen commented 1 week ago

@leoxxxxxD Hi, I have a small model, like Mobilenet. Have you finally concluded whether small models are suitable for training with self-supervised methods such as MAE? Thank you!