microsoft / SpeechT5

Unified-Modal Speech-Text Pre-Training for Spoken Language Processing
MIT License
1.22k stars 114 forks source link

pretrain loss #56

Open MarsMeng1994 opened 1 year ago

MarsMeng1994 commented 1 year ago

Excuse me, what value does my pre-training loss reach, can I start fintune tts?

image

i found my finued tts model can generate a mel-spectrom but diffrent to ori mel-spectrom very much。

image

Is this due to the bart loss is too high?

mechanicalsea commented 1 year ago

As mentioned in the SpeechT5 paper: "We pre-train the proposed SpeechT5 model on 32 V100 GPUs with a batch size of around 90s samples per GPU for speech and 12k tokens per GPU for text and set the update frequency to 2 for 500k steps." Thus, keeping pre-training. For TTS fine-tuning, the pre-training without $\mathcal{L}{mlm}^s$ is more suitable because as mentioned in the paper "The proposed SpeechT5 trained without $\mathcal{L}{mlm}^s$ is considered because the bidirectional masked pre- diction loss is proposed to help the encoder learn to encode the speech signal, and this variant achieves superior Naturalness, as shown in Table 13 (in Appendix D)."

MarsMeng1994 commented 1 year ago

thanks for reply does the nums_updates in the log means step? if true, it consume 2 hour for each 100 step in the picture, so it means it will consum10000 hour for pretrain? can i use a english pretrained model to fintune a other language model? can it work?

mechanicalsea commented 1 year ago

10000 hours seems so long. Actually, pre-training on the 32 V100 GPUs cost around one week. So pre-training using multiple gpu is recommended. The fine-tuning on the other languages is available by replace the English vocabulary to the fine-tuned vocabulary, but it causes language mismatch between pre-training and fine-tuning, which may influence the performance of the pre-training method.

MarsMeng1994 commented 1 year ago

thanks for reply i will try to use more GPU. There is an other question, when pretraining, the num_workers is 0, why don't set it to a higher number such as fintune tts

image

can i set it to a higher number to accelerate pretraining?

when i set num_workers=1, there is a error like: RuntimeError: unable to mmap 408 bytes from file : Cannot allocate