Several questions about model training

omglet1 commented 10 months ago

Hi, @hunto. Thanks for your answers to my previous questions ：https://github.com/hunto/DiffKD/issues/3 Your work is very meaningful, and it can bring new changes to knowledge distillation. This led me to try to reproduce your code for other computational vision tasks, such as human pose estimation, etc. However, I found that I still have a lot of questions about your articles and models, which caused me to have a lot of bad situations.

My questions are as follows：

When the diffusion model processes teacher or student features, does it perform operations such as normalization on these features? I have this question because some of the diffusion model codes I have seen perform some processing on the input of the diffusion model, but I have not seen relevant content in your code and paper.
In your paper, you stated that you would use a 1x1 convolution to make the number of channels of the student features consistent with the latent teacher features. I also find the operation in your code. However, after the diffusion model denoises the student features, should we use a 1x1 convolution to restore the number of student feature channels, or directly modify the settings of the student head?
I found that although the code I implemented achieved the denoising of student features, what frustrated me was that the noise adaptive matching module did not work. The output γ of the module did not decrease during the training phase, but was equal to 1. γ is basically equal to one in 1 Epoch, and is always equal to 1 thereafter. I wonder if you have encountered a similar situation? Does the occurrence of this situation mean that the effectiveness of the module may be affected by the task or data set?

BeiDaoya commented 6 months ago

你好，我如何才能辨别学生特征是否成功去噪呢？

hunto commented 6 months ago

@omglet1 ，

I think both w/ normalization or w/o normalization work as the original feature (image) in diffusion are not required to be Gaussian.
The transformed student features are only used for distillation, the original dimensions in student will not be changed.
The values of gamma is related to your tasks and models, if the teacher and student features have a similar amount of noises, the gamma should be close to 1 as no additional noise need to be added.

hunto commented 6 months ago

你好，我如何才能辨别学生特征是否成功去噪呢？

@BeiDaoya 你可以参考一下论文中的可视化图，通过可视化观察原始学生、降噪后学生、老师特征的相似度

hunto / DiffKD

Several questions about model training #4