jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild
Other
7.64k stars 746 forks source link

VoiceCraft with Parler-TTS's 10K hours speech data #105

Open rishikksh20 opened 6 months ago

rishikksh20 commented 6 months ago

Hi @jasonppy Have you look at the Huggingface's Data Speech, a 10k hour of clean curated TTS data : https://huggingface.co/datasets/parler-tts/mls-eng-10k-tags_tagged_10k_generated . I think training the 830M model on this data will result excellent and robust samples. I am planning to do some multi-lingual training on large datset. I have fined tuned 330M data on 1k hour of multi-lingual data and for good news it worked well and also preserves accent when we used multi-lingual lines to TTS.

rishikksh20 commented 6 months ago

@jasonppy Have you tried to use Vocos for decoding task rather than Encodec decoder, first of all it upsample the samples to 24 kHz and leads to clear crisp and better voice quality.

rishikksh20 commented 6 months ago

And another suggestion might improve quality of the Audio is to Replace Encodec fully with DAC similar to Parler-TTS (https://github.com/huggingface/parler-tts/blob/main/parler_tts/dac_wrapper/configuration_dac.py) . It results 44.1 kHz audio with 8 kbps bandwidth

Sweetapocalyps3 commented 5 months ago

Hi @rishikksh20, I really like your ideas. I strongly believe this model has great potential, but sadly the output audio quality is quite bad being limited to only 16000 Hz. Even using high quality input audios you must accept that output will be extremely dirty. Were you able to do any training or testing regarding what you proposed?

If you don't mind, can I ask you if you can share a notebook (if any) with the code you used to perform the finetune that you mentioned? I'd like to try a finetune too.

Thank you very much.