Thoughts on subsequent sound quality optimization

theodorblackbird / lina-speech

lina-speech : linear attention based text-to-speech

Other

104 stars 9 forks source link

Thoughts on subsequent sound quality optimization #8

Open ScottishFold007 opened 1 month ago

ScottishFold007 commented 1 month ago

I trained some Chinese models with this work of yours and made a discovery: The emotional pronunciation is quite good, but there is a lack of high-frequency information. The issue is that too much information is lost due to token loss, so there is a need for flow matching（eg,Matcha-TTS） to compensate for some of it. What do you think?

theodorblackbird commented 1 month ago

Interesting, can you share samples ? Probably vocos is not adapted. I found it to generalize poorly on languages that are not english. Switching to a more robust representation, and also setting an higher bitrate would help significantly.