Open rishikksh20 opened 6 months ago
@jasonppy Have you tried to use Vocos for decoding task rather than Encodec decoder, first of all it upsample the samples to 24 kHz and leads to clear crisp and better voice quality.
And another suggestion might improve quality of the Audio is to Replace Encodec fully with DAC similar to Parler-TTS (https://github.com/huggingface/parler-tts/blob/main/parler_tts/dac_wrapper/configuration_dac.py) . It results 44.1 kHz audio with 8 kbps bandwidth
Hi @rishikksh20, I really like your ideas. I strongly believe this model has great potential, but sadly the output audio quality is quite bad being limited to only 16000 Hz. Even using high quality input audios you must accept that output will be extremely dirty. Were you able to do any training or testing regarding what you proposed?
If you don't mind, can I ask you if you can share a notebook (if any) with the code you used to perform the finetune that you mentioned? I'd like to try a finetune too.
Thank you very much.
Hi @jasonppy Have you look at the Huggingface's Data Speech, a 10k hour of clean curated TTS data : https://huggingface.co/datasets/parler-tts/mls-eng-10k-tags_tagged_10k_generated . I think training the 830M model on this data will result excellent and robust samples. I am planning to do some multi-lingual training on large datset. I have fined tuned 330M data on 1k hour of multi-lingual data and for good news it worked well and also preserves accent when we used multi-lingual lines to TTS.