jasonppy / VoiceCraft

Zero-Shot Speech Editing and Text-to-Speech in the Wild
Other
7.63k stars 746 forks source link

I finetuned voicecraft on commonvoice-french, here are some of my findings/thoughts #154

Open zmy1116 opened 2 months ago

zmy1116 commented 2 months ago

Hello,

So I finetuned voicecraft on the french common voice-french dataset. It's quite exciting since it's my first time working on LLM and on full audio model (not just spectrogram -> classification like doing image recognition )! I just want to share here some of my thoughts/findings/questions because I see many open issues about finetuning, hopefully @jasonppy can also provide some insights/ suggestions!

data preparation

I already answered under this issue https://github.com/jasonppy/VoiceCraft/issues/138. Again I want to emphasize that while the algorithm itself is more involved and the model/voicecraft is pretty hairy and intimidating, preparing finetuning data is really straightforward. Essentially you need to do the following:

I want to address an issue I found when generating french phonemes. VoiceCraft generate IPA phonemes using the package phonemizer, if you use the same piece of code to generate phonemes for your language, sometimes you will get this:

for sentence:  Il va ensuite se positionner sur le dos de la femelle et s'accoupler.
['i', 'l', '_', 'v', 'a', '_', 'ɑ', '̃', 's', 'y', 'i', 't', '_', 's', 'ə', '_', 'p', 'o', 'z', 'i', 's', 'j', 'ɔ', 'n', 'e', '_', 's', 'y', 'ʁ', '_', 'l', 'ə', '_', '(', 'en', ')', 'd', 'ɒ', 's', '(', 'fr', ')', '_', 'd', 'ə', '_', 'l', 'a', '_', 'f', 'ə', 'm', 'ɛ', 'l', '_', 'e', '_', 's', 'a', 'k', 'u', 'p', 'l', 'e', '.']

You see, the phonemes set has this (en) and (fr), this is because the phonemizer thinks there is a language switch. Of course these are not true phonemes tokens, in order to remove these, set the flag

text_tokenizer = TextTokenizer(language='fr-fr', language_switch='remove-flags')

training code related

If you go through steps/train_utils.py, you see that training batches are not created with fixed sizes. Training batches are created such that:

Once a batch is distributed to a GPU process, we further separated with multiple steps of gradient accumulations. However, for whatever reason, THIS DID NOT WORK WELL ON MY GPUS SET. I'm training on 8xL4. For whatever reason, I always get OOM error even if I set accumulation steps to very high number. Therefore, I rewrote portion of the sampler such that instead of having a large batch of 10000 tokens and then split the batch to 10+ small steps, I directly make the sampler to produce a batch with at most 1000 tokens, and I do gradient update every 10 batches. The difference between the two method is that now I can control exactly how many tokens I process within a single step. If you have smaller GPUs and encounter similar issues, you can do what I did.

Training

One thing I think it would be beneficial for people if @jasonppy you can put training curves in your paper or in the repository so we know what to expect. Since this is my first time training LLM. I have no idea what to expect. My training curve look like below after 5 days. I see top10 accuracy is 0.56, I thought this is horrible!! And for the past two days I've been reviewing / validating the entire data generation / training process. Today I start to wonder what is the actual loss/accuracy when the model is trained on gigaspeech. So I compute loss and acc on 4 gigaspeech example... it turns out the returned loss and acc is worse than the values I'm currently have.

Then I realize that you are not supposed to have super high accuracy in the first place, because there are infinite number of ways to say a piece of sentence... image

how it works

It works as well (also share similar problem) as the model trained on english! Also since the dataset common voice french has its own problems, I think for a fully functional french model, we probably need to curate some higher quality dataset with more diverse intonation.

I guess now the biggest problem is that tempo of generated speech is not so realistic, especially with the long pauses. I know in the paper suggest to generate multiple of them and pick the shortest one. I'm wondering if we can do the following:

Anyway, thanks again to @jasonppy for this work!

Revln9 commented 2 months ago

Any chance you can share that model ? I'd love to try voicecraft with a different language^^

Thanks for the feedback. Will for sure help a lot of people !