microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.16k stars 113 forks source link

Speech2C "Inf detected in output" while training #22

Closed Sreyan88 closed 1 year ago

Sreyan88 commented 1 year ago

Hello!

Thank You for the great work again! I try to train Speech2C and got this error after 49 epochs:

[2022-11-07 00:33:16,340][fairseq.nan_detector][WARNING] - Inf detected in output of , shape: torch.Size([1464, 505]), forward Some training details: dataset: libri 360 k-means trained on: libri 100

config: https://drive.google.com/file/d/1Ms5m-cuTrv43xsntHBdM_PEWaXtGGMOR/view?usp=sharing hydra_log: https://drive.google.com/file/d/1HWvXqUGhNU-LnKNRj52HAbXPR-GqOVBU/view?usp=sharing

Can you please let me know if this has happened in your training setup ever? Or if you know where I am going wrong?

Thank You!

Ajyy commented 1 year ago

Hi, thanks for your attention.

This is a gradient overflow problem. I think the reason may be that you set a small max_token. The default setting of max_token is about 2800k, and we usually use 16 GPUs for training. You may need to increase your max_token or set update_freq to a larger number or try to use more GPUs for training.

Sreyan88 commented 1 year ago

Thank You so much for your reply.

1) I see your default max_token in config is at 1400k, which is also the same as the original HuBERT. My one is slightly decreased to 1000k. Do you think even this slight decrease is making a difference? 2) I see you haven't set any update_freq in your config explicitly. What, according to you, should I set that to? 3) Another change from your default setup is that I am using mfcc labels instead of hubert intermediate layer like original Speech2C. I am not sure if that makes a huge difference. However, I would like to ask which quantizer setup you trained your model on a) HuBERT trained for only 1st iteration (on mfcc) or 2) HuBERT trained for both iterations (mfcc + intermediate)

Thank You again!

Ajyy commented 1 year ago

Hi, the batch size is number of GPUs x max_token x update_freq. So the default setting of Speech2C is 32 x 1400k x 1, which is also equal to 16 x 2800k x 1.

Back to your problem, you just use 4 GPUs with a max_token of 1000k. So your batch size is 4 x 1000k x 1, which is much smaller than the default setting. In your case, you may need to adjust these three values to ensure the batch size is large enough, which can alleviate the gradient overflow problem.

For the third problem, the quantizer setup or our model is "2) HuBERT trained for both iterations (mfcc + intermediate)"

Hope this can help you.

Sreyan88 commented 1 year ago

Thank You for your reply. I am also guessing the quantizer might have an effect (beyond the batch size) because your added decoder might be relying on strong quantizer cues. The original HuBERT was trained twice (2 iterations). You have a strong quantizer and train once.

I am also curious how keeping distributed_world_size 32 did not throw any error for me even though I had just 4 gpus in my system.

I will investigate these and update you here! Request you please keep this open till then!