Open MarsMeng1994 opened 1 year ago
As mentioned in the SpeechT5 paper: "We pre-train the proposed SpeechT5 model on 32 V100 GPUs with a batch size of around 90s samples per GPU for speech and 12k tokens per GPU for text and set the update frequency to 2 for 500k steps." Thus, keeping pre-training. For TTS fine-tuning, the pre-training without $\mathcal{L}{mlm}^s$ is more suitable because as mentioned in the paper "The proposed SpeechT5 trained without $\mathcal{L}{mlm}^s$ is considered because the bidirectional masked pre- diction loss is proposed to help the encoder learn to encode the speech signal, and this variant achieves superior Naturalness, as shown in Table 13 (in Appendix D)."
thanks for reply does the nums_updates in the log means step? if true, it consume 2 hour for each 100 step in the picture, so it means it will consum10000 hour for pretrain? can i use a english pretrained model to fintune a other language model? can it work?
10000 hours seems so long. Actually, pre-training on the 32 V100 GPUs cost around one week. So pre-training using multiple gpu is recommended. The fine-tuning on the other languages is available by replace the English vocabulary to the fine-tuned vocabulary, but it causes language mismatch between pre-training and fine-tuning, which may influence the performance of the pre-training method.
thanks for reply i will try to use more GPU. There is an other question, when pretraining, the num_workers is 0, why don't set it to a higher number such as fintune tts
can i set it to a higher number to accelerate pretraining?
when i set num_workers=1, there is a error like: RuntimeError: unable to mmap 408 bytes from file : Cannot allocate
Excuse me, what value does my pre-training loss reach, can I start fintune tts?
i found my finued tts model can generate a mel-spectrom but diffrent to ori mel-spectrom very much。
Is this due to the bart loss is too high?