Question about the hyper-parameters used in other KD methods on different cases

megvii-research / mdistiller

The official implementation of [CVPR2022] Decoupled Knowledge Distillation https://arxiv.org/abs/2203.08679 and [ICCV2023] DOT: A Distillation-Oriented Trainer https://openaccess.thecvf.com/content/ICCV2023/papers/Zhao_DOT_A_Distillation-Oriented_Trainer_ICCV_2023_paper.pdf

808 stars 123 forks source link

Question about the hyper-parameters used in other KD methods on different cases #34

Open ZaberKo opened 1 year ago

ZaberKo commented 1 year ago

First of all, thank you for the excellent work. We are currently attempting to reproduce the performance of various KD methods, including FitNet, RKD, CRD, ReviewKD, and others, as detailed in the DKD paper. We have a question regarding the hyperparameters used in CIFAR-100 for different KD methods. Specifically, we are curious the values used across different teachers and students for these KD methods (except DKD). Would you mind posting these hyperparameters🥰?

Zzzzz1 commented 1 year ago

The hyperparameters should be the same between different teacher-student pairs. We simply reported the results in CRD's original paper.

ZaberKo commented 1 year ago

The hyperparameters should be the same between different teacher-student pairs. We simply reported the results in CRD's original paper.

@Zzzzz1 Thanks for the replay, this is very helpful. I have another question. Given that the OFD performance on ShuffleNet is reported in the paper, why the ShuffleNet models are not implemented (e.g.: get_bn_before_relu) in this repo? Would you mind explaining any concerns about it?

Zzzzz1 commented 1 year ago

Sorry for that. We didn't test the code with all pairs so the ShuffleNet get_bn_before_relu function for OFD is missed. We will fix that.

ufestkc commented 1 year ago

First of all, thank you for the excellent work. We are currently attempting to reproduce the performance of various KD methods, including FitNet, RKD, CRD, ReviewKD, and others, as detailed in the DKD paper. We have a question regarding the hyperparameters used in CIFAR-100 for different KD methods. Specifically, we are curious the values used across different teachers and students for these KD methods (except DKD). Would you mind posting these hyperparameters🥰?

Please let me know if you now have the values of these hyperparameters. And the author replied to you saying, 'The hyperparameters should be the same between different teacher-student pairs.' Does it mean that the same set of hyperparameters is used for all experiments?