Closed HubertHirac closed 1 year ago
@HubertHirac Thank you for your attention and interest! We are pleased to provide you with our .yaml configuration for your implementation. You can access the configuration files through this link: Google Drive
Here are some key details regarding the configuration:
layer_result
from the last 8 layers and then average them. Please feel free to let us know if you need any further clarification or assistance.
@P1ping Thank you very much for your reply! I have tried with your recipe, but still have some kind of gaps. Do you have your reproduced hubert base teacher model?It will be much helpful for me to cover the gaps.
@HubertHirac The HuBERT teacher model is uploaded here. You can try distillation with it.
@P1ping I'm sorry, but I still have questions. When I load the weights of lighthubert_stage1.pt provided as initial weights, and the above teacher model as teacher. The distill l1 loss is high. I got some questions:
@HubertHirac Hi there! Thank you for your questions. Here are the answers:
For the student's final representations, they are computed from the last layer's representations (after _final_layernorm). The student's representations are further projected by a linear layer. As for the teacher's representations, they are computed in the following way: a) the last 8 layers of representations are taken, and _instancenorm is applied to each of them; b) the average of the 8 layers is then obtained, resulting in a single sequence. _Instancenorm is only used before averaging.
Both lighthubert_stage1.pt and lighthubert_stage2.pt are pre-trained using the distillation process.
In the training phase, final_proj
and label_embs_concat
are not used. However, layer_pred_heads.11.weight
is utilized to project the final representations.
I hope this clarifies your questions. Let me know if you need any further information!
@P1ping Hi, thanks for your reply! I got two more details to confirm then:
@HubertHirac Hi~ Here are the choices we made:
For calculating the loss, we directly use loss = F.l1_loss(x.float(), y.float(), reduction="mean")
, where x
and y
represent the representations at the masked positions for the student model.
During the training phase, we set normalize=True, which, as you mentioned, can cause a mismatch for the teacher model. However, we believe that this mismatch has only minor effects. This is because a normalized waveform can be seen as a special case of an unnormalized waveform.
@P1ping Thank you for your reply! With your suggestion, I have a loss closer to yours. But I still got a different loss descent. In your checkpoint of Stage1, best loss maybe ~0.35, while in my training process, it was 0.005... This results 7.34 wer in superb asr downstream. I'd like to know if this is because of Dropout or problem of fairseq version. Have you used any data augmentation method? Have you used dropout between encoder and the final linear layer? And can you still find the commit id of your fairseq base codes or pull date of it? That may be useful for me for better reproduction. Thank you!
@HubertHirac Here is Stage 1 best loss as the number of updates num_updates: best_loss 400000: 0.343, 389462: 0.344, 369163: 0.354, 367228: 0.354, 292798: 0.355,
Question 1. Have you used any data augmentation method? Answer: No
Question 2. Have you used dropout between encoder and the final linear layer? Answer: No
Question 3. And can you still find the commit id of your fairseq base codes or pull date of it? Answer: hash:41847528fbcc13e901259207e3ca0ef8ddbb1573
Hi, With your help, I have reproduced results of lighthubert stage1 and stage2 (with asr wer 5.73 and 8.7). Thank you for your kind help! So I close this issue here.
Hi, I'm trying to reproduce lighthubert_stage1 and lighthubert_small, but got a big performance gap... Could you please supply more details of your training process (such as lr, scheduler or loss function code) for stage1 and stage2 training?
Thank you very much