When performing knowledge distillation (e.g., using CWD algorithm), how to determine the distillation modules corresponding to the two models?

open-mmlab / mmrazor

OpenMMLab Model Compression Toolbox and Benchmark.

https://mmrazor.readthedocs.io/en/latest/

Apache License 2.0

1.48k stars 231 forks source link

When performing knowledge distillation (e.g., using CWD algorithm), how to determine the distillation modules corresponding to the two models? #197

Open lb-hit opened 2 years ago

lb-hit commented 2 years ago

For example, the teacher model is faster rcnn and the student model is yolo v3.Where can I find out what modules the models have? When I write a random module, I get a key error.

lb-hit commented 2 years ago

HIT-cwh commented 2 years ago

Hi! Maybe it is a little hard to use FasterRCNN as a teacher to distill a yolo v3 student. To the best of my knowledge, there are no available algorithms which can handle this case. As the number and output shape of FPN layers are both different. So are the classification scores from detection heads. Maybe you can try to use FasterRCNN Res101 as a teacher to distill the FasterRCNN Res50 student first and see what happens.

lb-hit commented 2 years ago

@HIT-cwh Thank you for your answer! Looking at the HIT in your ID I have a feeling we may be alumni. I noticed in the original paper of CWD algorithm that the structure of the teacher and student networks can be different, just that different teacher networks were chosen for phase 1 and phase 2. According to you, am I to understand that distillation cannot be done between the stage 1 and stage 2 networks? Another question is, for example, like you mentioned, when using the same structure of faster rcnn for distillation, how to determine what modules to distill? Is there a script file for MMclassifcation to view?

HIT-cwh commented 2 years ago

It's possible to distill between one-stage and two-stage networks, ref to https://arxiv.org/abs/2207.02039. For example, with a MaskRCNN-Swin detector as the teacher, ResNet-50 based RetinaNet and FCOS achieve 41.5% and 43.9% mAP on COCO2017 in this paper. However, the architecture differences between yolov3 and frcnn are too large and we haven't found an algorithm to work with yet. Typically, the distillation is usually conducted among multi-scale intermediate features such as FPN features.

lb-hit commented 2 years ago

@HIT-cwh Thank you for your answer. The config file needs to be filled in with the names of the distillation modules for teachers and students, so how is the name of the module corresponding to the intermediate features determined? Also, I would like to ask a question about model search, there are some model search methods provided in MMrazor, such as DARTS, DetNAS, SPOS, AutoSlim, regarding the compression of the object detection model DETR, are any of these methods applicable?

rainyNighti commented 1 year ago

Hello, I also encountered the same problem. When using different teacher and student models, how to replace these two intermediate modules, or or where to modify the code of their content?