Open Liujingxiu23 opened 1 month ago
Does any train this model and using it to train LLM-based TTS. How about the performace? I mean performance of wav quanlity, as well as performace in zero-shot-TTS.
We found that, under fair comparison conditions, the speech synthesis quality of a single-layer WavTokenizer outperforms that of the 9-layer DAC in downstream autoregressive TTS models, with slight improvements in other text-to-speech aspects as well
Thank you for your reply!
But I also meet the same problem as https://github.com/jishengpeng/WavTokenizer/issues/34. There are mispronunciation in reconstruction wave, phones may sounds like its similar phones. Training more (3 epochs -> 5 epochs) seems not alleviate this problem. Do you have any other idea to slove this problem?
And do you think hifigan maybe a better model for decoder part?
Does any train this model and using it to train LLM-based TTS. How about the performace? I mean performance of wav quanlity, as well as performace in zero-shot-TTS.