Closed didadida-r closed 1 year ago
It seems that you have some params that does not receive grad. This will throw issues in Pytorch DDP training.
I am also doing distillation with conv-emformer and everything works well on my side. I will make a PR on conv_emformer_transducer_stateless2
today and I will keep you updated.
thanks, after removing the dummy params, the training proceduce is okay now. Still, look forward to your PR
Please have a look at #808
thanks
@marcoyang1998 i am trying to apply the hubert in chinese dataset. Unlike the librispeech, the result is worser compared to the baseline without hubert. And the codebook_loss cannot be optimized, simply shaking around 90, but the total_loss can be optimized. It seems that the codebook loss is harmful when applying to chinese dataset.
Do you have any advice or have you ever try applying hubert in chiense dataset?
It seems that the codebook loss is harmful when applying to chinese dataset.
I assume you are using the fine-tuned version of the Fairseq HuBERT model? The fine-tuning data is English so there could be a large domain mismatch, that could be the reason why it didn't work in your experiments.
Do you have any advice or have you ever try applying hubert in chiense dataset?
We've tried to apply MVQ KD on Chinese datasets. However, we were using a Chinese HuBERT model (https://github.com/TencentGameMate/chinese_speech_pretrain). This gives us some gain on WenetSpeech.
Another thing you could try is to use an un-finedtuned version of the original Fairseq HuBERT. You may find the download link here https://dl.fbaipublicfiles.com/hubert/hubert_xtralarge_ll60k.pt
I think the codebook loss normally stays quite large, it will not get close to 0 I think. This is because there are many codebooks, and they have relatively high entropy.
It seems that the codebook loss is harmful when applying to chinese dataset.
I assume you are using the fine-tuned version of the Fairseq HuBERT model? The fine-tuning data is English so there could be a large domain mismatch, that could be the reason why it didn't work in your experiments.
Do you have any advice or have you ever try applying hubert in chiense dataset?
We've tried to apply MVQ KD on Chinese datasets. However, we were using a Chinese HuBERT model (https://github.com/TencentGameMate/chinese_speech_pretrain). This gives us some gain on WenetSpeech.
Another thing you could try is to use an un-finedtuned version of the original Fairseq HuBERT. You may find the download link here https://dl.fbaipublicfiles.com/hubert/hubert_xtralarge_ll60k.pt
I am using the un-finetuned version of the original hubert, but it seems that codebook loss cannot be optimized with the total rnnt loss in the same direction in cross-lin domain. The CER will degard from 0.1 to 0.8 in different test set.
"This gives us some gain on WenetSpeech.", it's that mean a. using 'wenetspeech hubert model' to 'wenetspeech asr task', or b. using 'wenetspeech hubert model' to 'aishell asr task'?
I think the codebook loss normally stays quite large, it will not get close to 0 I think. This is because there are many codebooks, and they have relatively high entropy.
Hi Dan,
In the librispeech result(https://huggingface.co/GuoLiyong/stateless6_baseline_vs_disstillation/tensorboard?scroll=1#scalars), the codebook loss will degrad from 100 to 20.
In mine cross-lin result, the codebook loss only degard very slowly from 100 to 90.
I am using the un-finetuned version of the original hubert
So you are using codebook indexes extracted on your own?
"This gives us some gain on WenetSpeech.", it's that mean a. using 'wenetspeech hubert model' to 'wenetspeech asr task', or b. using 'wenetspeech hubert model' to 'aishell asr task'?
I mean the first statement. We didn't test it on aishell.
The codebook loss normally stays at a quite high value. Here are some stats for your reference: Number of Codebooks: 16 Codebook loss: 100 -> 30 (after 3 epochs)
We recommend using a delta in streaming MVQ KD to address the issue of the limited future context of streaming student model. You may find the implementation here https://github.com/k2-fsa/icefall/pull/808/files#diff-06da385f1213f88bce406acabd71a1284016b9b21f4a790a36e2897bedba2e8c.
We also recommend using a smaller scale for codebook loss. We normally set codebook-loss-scale
to 0.005 and 0.01 for 32 and 16 codebook setups respectively.
thanks.
Number of Codebooks: 16
here 16 is the number of codebook in extract or train? because i notice that in librispeech, the num_of_codebook is 8 in extracting, and the num_of_codebook is 16 in train.I think the codebook loss normally stays quite large, it will not get close to 0 I think. This is because there are many codebooks, and they have relatively high entropy.
Hi Dan,
In the librispeech result(https://huggingface.co/GuoLiyong/stateless6_baseline_vs_disstillation/tensorboard?scroll=1#scalars), the codebook loss will degrad from 100 to 20.
In mine cross-lin result, the codebook loss only degard very slowly from 100 to 90.
In one of our earlier experiments, there was a problem (a kind of bug) in the codebook index extraction that resulted in them having a much lower entropy than there should be. After fixing this, the change in codebook loss became much less. (but perhaps not as small as 100 -> 90.)
16 is the number of codebook in extract or train?
Training. HuBERT model has a doubled encoder rate, so we concatenate every two codebook "frames", resulting in a doubled number of codebook in training.
Just curious, how is your distillation experiments going? Would you mind share your results? I am quite interested in doing cross language distillation.
Hi, i am using the hubert distill to conv emformer, but it crashs when start runing. Here is the detailed log.