Reproduction of LightHubert

HubertHirac commented 1 year ago

Hi, I'm trying to reproduce lighthubert_stage1 and lighthubert_small, but got a big performance gap... Could you please supply more details of your training process (such as lr, scheduler or loss function code) for stage1 and stage2 training?

Thank you very much

P1ping commented 1 year ago

@HubertHirac Thank you for your attention and interest! We are pleased to provide you with our .yaml configuration for your implementation. You can access the configuration files through this link: Google Drive

Here are some key details regarding the configuration:

For both Stage 1 and Stage 2, we provide the settings for learning rates, scheduler, steps, and batch sizes.
To perform distillation, a HuBERT teacher model is required. In our approach, we follow the data2vec method to extract the teacher's representations. Specifically, we extract the layer_result from the last 8 layers and then average them.
We utilize the L1 loss for both training stages.
Training stage 2 necessitates a configuration file that specifies the subnet sampling space in some way. Through this configuration file, you can specify either Small or Base as per your requirements.

Please feel free to let us know if you need any further clarification or assistance.

HubertHirac commented 1 year ago

@P1ping Thank you very much for your reply! I have tried with your recipe, but still have some kind of gaps. Do you have your reproduced hubert base teacher model？It will be much helpful for me to cover the gaps.

P1ping commented 1 year ago

@HubertHirac The HuBERT teacher model is uploaded here. You can try distillation with it.

HubertHirac commented 1 year ago

@P1ping I'm sorry, but I still have questions. When I load the weights of lighthubert_stage1.pt provided as initial weights, and the above teacher model as teacher. The distill l1 loss is high. I got some questions:

Is the l1 loss between the output of final linear layer (layer_output) and the average k teacher layer result? And the teacher layer result is after the instance layer norm (as group_norm_target_layer=True in data2vec).
Are the lighthubert_stage1.pt and lighthubert_small.pt all pretrained models or finetuned?
I found that final_proj and label_embs_concat are also in checkpoints released, are they useful in training phase?

P1ping commented 1 year ago

@HubertHirac Hi there! Thank you for your questions. Here are the answers:

For the student's final representations, they are computed from the last layer's representations (after _final_layernorm). The student's representations are further projected by a linear layer. As for the teacher's representations, they are computed in the following way: a) the last 8 layers of representations are taken, and _instancenorm is applied to each of them; b) the average of the 8 layers is then obtained, resulting in a single sequence. _Instancenorm is only used before averaging.
Both lighthubert_stage1.pt and lighthubert_stage2.pt are pre-trained using the distillation process.
In the training phase, final_proj and label_embs_concat are not used. However, layer_pred_heads.11.weight is utilized to project the final representations.

I hope this clarifies your questions. Let me know if you need any further information!

HubertHirac commented 1 year ago

@P1ping Hi, thanks for your reply! I got two more details to confirm then:

Does the l1 loss has scale factor? Is it like the one of data2vec as: loss = F.l1_loss(x.float(), y.float(), reduction="none").sum(dim=-1) sample_size = loss.numel() loss = loss.sum() / math.sqrt(x.size(-1)) Or just l1 loss with no scale: loss = F.l1_loss(x.float(), y.float()) sample_size = loss.numel() loss = loss.sum()
I found that the teacher model released is normalized=False, while the lighthubert model is trained with normalized=True. Does it influence the final results?

P1ping commented 1 year ago

@HubertHirac Hi~ Here are the choices we made:

For calculating the loss, we directly use loss = F.l1_loss(x.float(), y.float(), reduction="mean"), where x and y represent the representations at the masked positions for the student model.
During the training phase, we set normalize=True, which, as you mentioned, can cause a mismatch for the teacher model. However, we believe that this mismatch has only minor effects. This is because a normalized waveform can be seen as a special case of an unnormalized waveform.

HubertHirac commented 1 year ago

@P1ping Thank you for your reply! With your suggestion, I have a loss closer to yours. But I still got a different loss descent. In your checkpoint of Stage1, best loss maybe ~0.35, while in my training process, it was 0.005... This results 7.34 wer in superb asr downstream. I'd like to know if this is because of Dropout or problem of fairseq version. Have you used any data augmentation method? Have you used dropout between encoder and the final linear layer? And can you still find the commit id of your fairseq base codes or pull date of it? That may be useful for me for better reproduction. Thank you!

mechanicalsea commented 1 year ago

@HubertHirac Here is Stage 1 best loss as the number of updates num_updates: best_loss 400000: 0.343, 389462: 0.344, 369163: 0.354, 367228: 0.354, 292798: 0.355,

Question 1. Have you used any data augmentation method? Answer: No

Question 2. Have you used dropout between encoder and the final linear layer? Answer: No

Question 3. And can you still find the commit id of your fairseq base codes or pull date of it? Answer: hash:41847528fbcc13e901259207e3ca0ef8ddbb1573

HubertHirac commented 1 year ago

Hi, With your help, I have reproduced results of lighthubert stage1 and stage2 (with asr wer 5.73 and 8.7). Thank you for your kind help! So I close this issue here.

mechanicalsea / lighthubert

Reproduction of LightHubert #6