jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild
Other
7.43k stars 727 forks source link

Finetuning #70

Open rlenain opened 5 months ago

rlenain commented 5 months ago

Hello, and thanks a lot for your great work.

I am finetuning your 830M model on custom data, but getting to a point where I am overfitting fairly quickly.

I was wondering, are there training details you recommend for finetuning? Maybe specific learning rate, or parts of the model to freeze? More generally, do you have an idea of how many hours would be required in order to not overfit with this size of model?

Thanks a lot!

jasonppy commented 5 months ago

Thanks!

I haven't tried finetuning a lot. You could use the 330M model. I tried finetuning on as small as 550h libritts and it seems that it doesn't overfit - I used the AdamW optimizer (directly supported by the codebase), lr=1e-5, max_num_tokens=20000. I haven't tried freezing part of the model but that sounds very reasonable (LORA is another option)

danablend commented 5 months ago

Thanks!

I haven't tried finetuning a lot. You could use the 330M model. I tried finetuning on as small as 550h libritts and it seems that it doesn't overfit - I used the AdamW optimizer (directly supported by the codebase), lr=1e-5, max_num_tokens=20000. I haven't tried freezing part of the model but that sounds very reasonable (LORA is another option)

Hey! Curious if you would happen to have an idea of which layers may be ideal to apply LoRA fine tuning to and if so how many of them?

For example, would it be suitable to train LoRA layers on the first N and last N attention layers in the TransformerEncoder based on trial/error (could be found with sweeping), or would a different approach be more ideal to capture a new speaker's characteristics such as tonality and prosody?

One super useful case for LoRAs would be the ability to quickly swap between fine tuned models in runtime with less overhead than swapping a complete model :)

Thank you!

rishikksh20 commented 4 months ago

Hi @jasonppy I am finetuning the 330M TTS model on multi-lingual data, and here is the tensorboard image With finetuning on a single A6000 with max num of token 10K with accum grad 24, do you think curve looks good ? And what is the good value for loss and top10 ?

jasonppy commented 4 months ago

Looks pretty good! I think it's worth generating a few sample.

I don't really know what is a good value for loss and top10, as different data have different levels of difficulty. For english only data, audiobooks usually gets cb1 top10 60 or higher on training, gigaspeech gets ~55 for cb1.

max_num_token is for 1 gpu, so if you run it on 4 gpus, the effective token consumed for one gradient update is 40k.

rishikksh20 commented 4 months ago

@jasonppy Yes 52-55 % start producing good voice, I will also let you know it works great with multi-lingual data. I finetuned this on 3 lang data and even when I combine the phonemes from different languages into a single line it is able to speak perfectly like code-switch language models. Kudos for the great work

rikabi89 commented 4 months ago

@jasonppy Yes 52-55 % start producing good voice, I will also let you know it works great with multi-lingual data. I finetuned this on 3 lang data and even when I combine the phonemes from different languages into a single line it is able to speak perfectly like code-switch language models. Kudos for the great work

Would mind sharing how you setup a custom dataset for fine-tuning please such any such script etc?

jasonppy commented 4 months ago

@jasonppy Yes 52-55 % start producing good voice, I will also let you know it works great with multi-lingual data. I finetuned this on 3 lang data and even when I combine the phonemes from different languages into a single line it is able to speak perfectly like code-switch language models. Kudos for the great work

I plan to post an update with the newly uploaded 830M TTS enhanced model, community efforts on Gradio, Replicate apps, multi-span editing, vram reduction etc. Would you like you give me a pointer to your multilingual work? it would be great it I can also include your work in the announcement.

rishikksh20 commented 4 months ago

Hi @jasonppy For multi-lingual, I don't do anything extra I just rely on Espeak-ng phonemes. Create a dataset based on Phonemizer language-based phonemes, mix all multi-lingual datasets, and mix the phoneme vocabs using the pre-processing scripts of this repo. And simply update the Phoneme embedding of the pre-train model and training. I trained in 3 different lang + accents (Hindi, English, and Indian English), sometimes it overflows the prompt speaker accents to target despite of language switch but most of the time it worked if I passed the native English speaker prompt and generated Hindi transcript it didn't carry forward US accent to Hindi or Indian English and vice versa.

rishikksh20 commented 4 months ago

But I think when we include lots of languages and accents it might not work as intended because many IPA phonemes are shared between the languages, so might be needed to pass any token along with transcripts to signify the Accent and language for better pronunciation.

rikabi89 commented 4 months ago

Hello, and thanks a lot for your great work.

I am finetuning your 830M model on custom data, but getting to a point where I am overfitting fairly quickly.

I was wondering, are there training details you recommend for finetuning? Maybe specific learning rate, or parts of the model to freeze? More generally, do you have an idea of how many hours would be required in order to not overfit with this size of model?

Thanks a lot!

Hi would mind sharing how you setup a custom dataset for fine-tuning please such any such script etc?

thivux commented 1 month ago

@rishikksh20 hi, i'm curious about the duration of the dataset you used for multilingual fine-tuning. i am currently fine-tuning the model with 450 hours vietnamese + 115 hours english dataset, but the inference results are very sensitive to hyperparams (seed, stop_repitiion & sample_batch_size). did you have a hard time tuning those hyperparams to get good results?

rishikksh20 commented 1 month ago

@thivux Yes it's sensitive to the hyper-params but it gives good performance on certain parameters

Me210400 commented 1 month ago

Hi @jasonppy I am finetuning the 330M TTS model on multi-lingual data, and here is the tensorboard image With finetuning on a single A6000 with max num of token 10K with accum grad 24, do you think curve looks good ? And what is the good value for loss and top10 ?

how you prepare your custom dataset?

Magauiya commented 1 week ago

Hi @rishikksh20! Thank you for sharing insights for multilanguage scenario! Have you tried training from scratch and compare with finetuned one? Also how do you assess (metrics) the quality of your fine-tuned model?

rishikksh20 commented 1 week ago

Nope I only finetune the model, training from scratch will required too much data and compute. I listened to the files generated from diverse set of paragraph and listen, is the only way to check the quality.