Performance in LLM-based-TTS

jishengpeng / WavTokenizer

SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling

MIT License

786 stars 43 forks source link

Performance in LLM-based-TTS #40

Open Liujingxiu23 opened 1 month ago

Liujingxiu23 commented 1 month ago

Does any train this model and using it to train LLM-based TTS. How about the performace? I mean performance of wav quanlity, as well as performace in zero-shot-TTS.

jishengpeng commented 1 month ago

Does any train this model and using it to train LLM-based TTS. How about the performace? I mean performance of wav quanlity, as well as performace in zero-shot-TTS.

We found that, under fair comparison conditions, the speech synthesis quality of a single-layer WavTokenizer outperforms that of the 9-layer DAC in downstream autoregressive TTS models, with slight improvements in other text-to-speech aspects as well

Liujingxiu23 commented 1 month ago

Thank you for your reply！

But I also meet the same problem as https://github.com/jishengpeng/WavTokenizer/issues/34. There are mispronunciation in reconstruction wave, phones may sounds like its similar phones. Training more (3 epochs -> 5 epochs) seems not alleviate this problem. Do you have any other idea to slove this problem?

And do you think hifigan maybe a better model for decoder part?