hunto / DIST_KD

Official implementation of paper "Knowledge Distillation from A Stronger Teacher", NeurIPS 2022
Apache License 2.0
138 stars 20 forks source link

Strategy B3 #12

Open june6423 opened 3 months ago

june6423 commented 3 months ago

Greetings!

I read your paper with great interest and am trying to reproduce some of your experiments.

I want to reproduce your vanilla KD setting using strategy B1, B2, B3 based on your DIST_KD paper.

I found B1 and B2 strategy on your strategies folder, but I couldn't find B3 setting.

configs/strategies/deit/deit_tiny.yaml appears to be B3, but I'm not sure, which leaves me with a question.

Could you give me B3 setting with vanilla KD with temperature 4?

hunto commented 3 months ago

Hi @june6423 ,

Our B3 experiment on Swin Transformer was implemented on the original training code of Swin-Transformer, so there's no b3 config in this repo.

Alternatively, if you want to implement B3 on this repo, the strategy is similar to deit_tiny, you can use the following config for KD (T=4):

aa: rand-m9-mstd0.5
batch_size: 128 # x 8 gpus = 1024bs
color_jitter: 0.4
decay_by_epoch: false
decay_epochs: 3
decay_rate: 0.967
# dropout
drop: 0.0
drop_path_rate: 0.2

epochs: 300
log_interval: 50
lr: 1.e-3
min_lr: 5.0e-06
model_ema: False
model_ema_decay: 0.999
momentum: 0.9
opt: adamw
opt_betas: null
opt_eps: 1.0e-08
clip_grad_norm: true
clip_grad_max_norm: 5.0

interpolation: 'bicubic'

# random erase
remode: pixel
reprob: 0.25

# mixup
mixup: 0.8
cutmix: 1.0
mixup_prob: 1.0
mixup_switch_prob: 0.5
mixup_mode: 'batch'

sched: cosine
seed: 42
warmup_epochs: 20
warmup_lr: 5.e-7
weight_decay: 0.04
workers: 16

# kd
kd: 'kd'
ori_loss_weight: 1.
kd_loss_weight: 1.
teacher_model: 'timm_swin_large_patch4_window7_224'
teacher_pretrained: True
june6423 commented 3 months ago

Thanks a lot!

Now I want to reproduce results of other KD methods including RKD and CRD (I am working on your Table5 in DIST_KD paper, CIFAR 100)

But I failed to find training config and code for training from scratch and other KD methods.

I am working on image_classification_sota with d9662f7 version.

I am wondering if there is already published code to experiment with these settings, or if I should implement them myself.

Thanks for your effort.