huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
4.18k stars 411 forks source link

some question to prepare multilinguality training from scratch #97

Open acul3 opened 1 month ago

acul3 commented 1 month ago

congrats to release v2 parler-tts @sanchit-gandhi @ylacombe or anyone involve

i am trying to explore reproduce multilinguality training, some question to ask if i want to train it multilingual

  1. is it necessary(worth it) to change text encoder to support multilingual speech?,parler-tts use flant5, do you have recommended if any for multilinguality to start with?

  2. and how about the encodec/dac ? ( i believe this it not really necessary since encodec/dac work on low level CMIIW) , or maybe i have to train/finetune speech tokenizer like facodec/speechtokenizer

  3. will adding more noisy dataset will give you robust output?

i am planning to put 8k hours my lang(malay) dataset and mix it with english dataset ~20k hours ( subset mls english , gigaspeech , and libritts)

thank you

ylacombe commented 1 month ago

Hey @acul3, thanks for opening this issue!

Multilinguality is something we'll try to actively support in the coming weeks.

  1. IMO, it's not worth it. I would keep the description in English (and thus keep the text encoder fixed) and change the prompt tokenizer
  2. DAC can stay the same IMO, it should be agnostic to the language used
  3. As long as you also have clean data, I believe adding noisy data should be ok, this is what we actually do with Parler (most of the dataset is noisy, and 1K hours is really clean)

Let me know how your effort go! I plan to write more extensively about multilingual fine-tuning in a few weeks

acul3 commented 1 month ago

@ylacombe thanks for your reply and confirmation for my point

i think one challenging to train parler is need of speaker gender in audio data because we need it to create prompt (CMIIW)

my effort right now is to train gender classification on labeled data in my language(common voice is good start), and start to label my unlabelled data to it(maybe pick 0.9 confidence)

if you have other option , feel free to suggest , thanks once again