jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
MIT License
830 stars 46 forks source link

Traning on wenetspeech couldn‘t converge #28

Open dyyoungg opened 2 months ago

dyyoungg commented 2 months ago

I compared two experimental data setups. setting 1: WenetSpeech(Chinese)only setting 2: Wenet + Giga (about 1:1, Chinese + English)

It's interesting that training on setting 1 can't decrease normally (blue curve in the following image), while setting 2 mixed with English can converge normally. Have you observed this phenomenon in your experiments?

loss
jishengpeng commented 2 months ago

I compared two experimental data setups. setting 1: WenetSpeech(Chinese)only setting 2: Wenet + Giga (about 1:1, Chinese + English)

It's interesting that training on setting 1 can't decrease normally (blue curve in the following image), while setting 2 mixed with English can converge normally. Have you observed this phenomenon in your experiments?

loss

This situation is somewhat unusual. You may use a small amount of Chinese data (approximately 500 hours) to verify whether this issue always arises when the model is trained on purely Chinese data.

wntg commented 2 months ago

I‘m intersting in Chinese too. Do you have any further results?

boltzmann-Li commented 4 days ago

WenetSpeech could be too noisy, you may want to start with AIShell3, then WenetSpeech4TTS.