Distill - Githubissues

didadida-r commented 1 year ago

Hi, i am using the hubert distill to conv emformer, but it crashs when start runing. Here is the detailed log.

WARNING:root:MonoCut acacda0e-6344-11ec-9812-525400127f85-3182: possibly mismatched duration between cut (3.1s) and temporal array in custom field 'codebook_indexes' (num_frames=154 * frame_shift=0.02 == duration=3.08).
WARNING:root:MonoCut 070655c4-6341-11ec-9b1e-525400cda5d2-10014: possibly mismatched duration between cut (4.16s) and temporal array in custom field 'codebook_indexes' (num_frames=207 * frame_shift=0.02 == duration=4.14).
WARNING:root:MonoCut 931c769a-5b38-11eb-a30b-525400ee5896-2688: possibly mismatched duration between cut (4.16s) and temporal array in custom field 'codebook_indexes' (num_frames=207 * frame_shift=0.02 == duration=4.14).
WARNING:root:MonoCut d392920a-e976-11eb-b047-525400f4a99f-4728: possibly mismatched duration between cut (4.16s) and temporal array in custom field 'codebook_indexes' (num_frames=207 * frame_shift=0.02 == duration=4.14).
2023-01-03 11:30:49,484 INFO [train.py:978] (1/2) Epoch 1, batch 0, loss[loss=4.594, simple_loss=9.189, pruned_loss=9.556, codebook_loss=93.57, over 1631.00 frames. utt_duration=363.1 frames, utt_pad_proportion=0.01862, over 18.00 utterances.], tot_loss[loss=4.594, simple_loss=9.189, pruned_loss=9.556, codebook_loss=93.57, over 1631.00 frames. utt_duration=363.1 frames, utt_pad_proportion=0.01862, over 18.00 utterances.], batch size: 18, lr: 3.00e-03
2023-01-03 11:30:49,485 INFO [train.py:978] (0/2) Epoch 1, batch 0, loss[loss=4.541, simple_loss=9.082, pruned_loss=9.423, codebook_loss=94.85, over 1672.00 frames. utt_duration=836.8 frames, utt_pad_proportion=0.02703, over 8.00 utterances.], tot_loss[loss=4.541, simple_loss=9.082, pruned_loss=9.423, codebook_loss=94.85, over 1672.00 frames. utt_duration=836.8 frames, utt_pad_proportion=0.02703, over 8.00 utterances.], batch size: 8, lr: 3.00e-03
WARNING:root:MonoCut 183d501e-9172-11eb-ab41-525400f66387-18006: possibly mismatched duration between cut (7.32s) and temporal array in custom field 'codebook_indexes' (num_frames=365 * frame_shift=0.02 == duration=7.3).
WARNING:root:MonoCut c267e594-e975-11eb-8124-525400f4a99f-3584: possibly mismatched duration between cut (14.48s) and temporal array in custom field 'codebook_indexes' (num_frames=723 * frame_shift=0.02 == duration=14.46).
WARNING:root:MonoCut 100b075c-f9c8-11ea-8ab3-525400a90b5a-14137: possibly mismatched duration between cut (10.76s) and temporal array in custom field 'codebook_indexes' (num_frames=537 * frame_shift=0.02 == duration=10.74).
concat_successive_codebook_indexes torch.Size([8, 212, 256]) torch.Size([8, 212, 16])
encoder_out torch.Size([8, 212, 256])
middle_layer_output torch.Size([8, 212, 256])
tensor(158591.0469, device='cuda:0', grad_fn=<CheckpointFunctionBackward>)
concat_successive_codebook_indexes torch.Size([18, 89, 256]) torch.Size([18, 89, 16])
encoder_out torch.Size([18, 89, 256])
middle_layer_output torch.Size([18, 89, 256])
tensor(152618.7812, device='cuda:1', grad_fn=<CheckpointFunctionBackward>)
Traceback (most recent call last):
  File "./conv_emformer_transducer_stateless2_hubert/train.py", line 1269, in <module>
    main()
  File "./conv_emformer_transducer_stateless2_hubert/train.py", line 1260, in main
    mp.spawn(run, args=(world_size, args), nprocs=world_size, join=True)
  File "/home/test/.conda/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/home/test/.conda/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
    while not context.join():
  File "/home/test/.conda/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 160, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/test/.conda/envs/k2/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
    fn(i, *args)
  File "/home/test/code/asr/k2_asr/train/egs/ntes_dev/conv_emformer_transducer_stateless2_hubert/train.py", line 1168, in run
    train_one_epoch(
  File "/home/test/code/asr/k2_asr/train/egs/ntes_dev/conv_emformer_transducer_stateless2_hubert/train.py", line 918, in train_one_epoch
    loss, loss_info = compute_loss(
  File "/home/test/code/asr/k2_asr/train/egs/ntes_dev/conv_emformer_transducer_stateless2_hubert/train.py", line 769, in compute_loss
    simple_loss, pruned_loss, codebook_loss = model(
  File "/home/test/.conda/envs/k2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/test/.conda/envs/k2/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 994, in forward
    if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by 
making sure all `forward` function outputs participate in calculating loss. 
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 286 287 288 289 290 291
 In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

marcoyang1998 commented 1 year ago

It seems that you have some params that does not receive grad. This will throw issues in Pytorch DDP training.

I am also doing distillation with conv-emformer and everything works well on my side. I will make a PR on conv_emformer_transducer_stateless2 today and I will keep you updated.

didadida-r commented 1 year ago

thanks, after removing the dummy params, the training proceduce is okay now. Still, look forward to your PR

marcoyang1998 commented 1 year ago

Please have a look at #808

didadida-r commented 1 year ago

thanks

didadida-r commented 1 year ago

@marcoyang1998 i am trying to apply the hubert in chinese dataset. Unlike the librispeech, the result is worser compared to the baseline without hubert. And the codebook_loss cannot be optimized, simply shaking around 90, but the total_loss can be optimized. It seems that the codebook loss is harmful when applying to chinese dataset.

Do you have any advice or have you ever try applying hubert in chiense dataset?

marcoyang1998 commented 1 year ago

It seems that the codebook loss is harmful when applying to chinese dataset.

I assume you are using the fine-tuned version of the Fairseq HuBERT model? The fine-tuning data is English so there could be a large domain mismatch, that could be the reason why it didn't work in your experiments.

Do you have any advice or have you ever try applying hubert in chiense dataset?

We've tried to apply MVQ KD on Chinese datasets. However, we were using a Chinese HuBERT model (https://github.com/TencentGameMate/chinese_speech_pretrain). This gives us some gain on WenetSpeech.

Another thing you could try is to use an un-finedtuned version of the original Fairseq HuBERT. You may find the download link here https://dl.fbaipublicfiles.com/hubert/hubert_xtralarge_ll60k.pt

danpovey commented 1 year ago

I think the codebook loss normally stays quite large, it will not get close to 0 I think. This is because there are many codebooks, and they have relatively high entropy.

didadida-r commented 1 year ago

It seems that the codebook loss is harmful when applying to chinese dataset.

I assume you are using the fine-tuned version of the Fairseq HuBERT model? The fine-tuning data is English so there could be a large domain mismatch, that could be the reason why it didn't work in your experiments.

Do you have any advice or have you ever try applying hubert in chiense dataset?

We've tried to apply MVQ KD on Chinese datasets. However, we were using a Chinese HuBERT model (https://github.com/TencentGameMate/chinese_speech_pretrain). This gives us some gain on WenetSpeech.

Another thing you could try is to use an un-finedtuned version of the original Fairseq HuBERT. You may find the download link here https://dl.fbaipublicfiles.com/hubert/hubert_xtralarge_ll60k.pt

I am using the un-finetuned version of the original hubert, but it seems that codebook loss cannot be optimized with the total rnnt loss in the same direction in cross-lin domain. The CER will degard from 0.1 to 0.8 in different test set.

"This gives us some gain on WenetSpeech.", it's that mean a. using 'wenetspeech hubert model' to 'wenetspeech asr task', or b. using 'wenetspeech hubert model' to 'aishell asr task'?

didadida-r commented 1 year ago

I think the codebook loss normally stays quite large, it will not get close to 0 I think. This is because there are many codebooks, and they have relatively high entropy.

Hi Dan,

In the librispeech result(https://huggingface.co/GuoLiyong/stateless6_baseline_vs_disstillation/tensorboard?scroll=1#scalars), the codebook loss will degrad from 100 to 20.

In mine cross-lin result, the codebook loss only degard very slowly from 100 to 90.

marcoyang1998 commented 1 year ago

I am using the un-finetuned version of the original hubert

So you are using codebook indexes extracted on your own?

"This gives us some gain on WenetSpeech.", it's that mean a. using 'wenetspeech hubert model' to 'wenetspeech asr task', or b. using 'wenetspeech hubert model' to 'aishell asr task'?

I mean the first statement. We didn't test it on aishell.

The codebook loss normally stays at a quite high value. Here are some stats for your reference: Number of Codebooks: 16 Codebook loss: 100 -> 30 (after 3 epochs)

We recommend using a delta in streaming MVQ KD to address the issue of the limited future context of streaming student model. You may find the implementation here https://github.com/k2-fsa/icefall/pull/808/files#diff-06da385f1213f88bce406acabd71a1284016b9b21f4a790a36e2897bedba2e8c.

We also recommend using a smaller scale for codebook loss. We normally set codebook-loss-scale to 0.005 and 0.01 for 32 and 16 codebook setups respectively.

didadida-r commented 1 year ago

thanks.

yeah, i recompute the codebook on my own data.
Number of Codebooks: 16 here 16 is the number of codebook in extract or train? because i notice that in librispeech, the num_of_codebook is 8 in extracting, and the num_of_codebook is 16 in train.
i will try adding the delta version and tuning loss-scale.

danpovey commented 1 year ago

I think the codebook loss normally stays quite large, it will not get close to 0 I think. This is because there are many codebooks, and they have relatively high entropy.

Hi Dan,

In the librispeech result(https://huggingface.co/GuoLiyong/stateless6_baseline_vs_disstillation/tensorboard?scroll=1#scalars), the codebook loss will degrad from 100 to 20.

In mine cross-lin result, the codebook loss only degard very slowly from 100 to 90.

In one of our earlier experiments, there was a problem (a kind of bug) in the codebook index extraction that resulted in them having a much lower entropy than there should be. After fixing this, the change in codebook loss became much less. (but perhaps not as small as 100 -> 90.)

marcoyang1998 commented 1 year ago

16 is the number of codebook in extract or train?

Training. HuBERT model has a doubled encoder rate, so we concatenate every two codebook "frames", resulting in a doubled number of codebook in training.

marcoyang1998 commented 1 year ago

Just curious, how is your distillation experiments going? Would you mind share your results? I am quite interested in doing cross language distillation.

k2-fsa / icefall

Distill #803