So I finetuned voicecraft on the french common voice-french dataset. It's quite exciting since it's my first time working on LLM and on full audio model (not just spectrogram -> classification like doing image recognition )! I just want to share here some of my thoughts/findings/questions because I see many open issues about finetuning, hopefully @jasonppy can also provide some insights/ suggestions!
data preparation
I already answered under this issue https://github.com/jasonppy/VoiceCraft/issues/138. Again I want to emphasize that while the algorithm itself is more involved and the model/voicecraft is pretty hairy and intimidating, preparing finetuning data is really straightforward. Essentially you need to do the following:
generate audio encodec for each audio file, save them
generate phonemes set for each text file, save them
modify the model text embedding weights if the total number of phonemes exceed the number the pretrained model uses:
pretrained model uses 80, but the embedding size is 101 and the last one is reserved for padding, so if your total phonemes size is within 100 you don't need to do anything. Otherwise you need to expand this tensor.
I want to address an issue I found when generating french phonemes. VoiceCraft generate IPA phonemes using the package phonemizer, if you use the same piece of code to generate phonemes for your language, sometimes you will get this:
for sentence: Il va ensuite se positionner sur le dos de la femelle et s'accoupler.
['i', 'l', '_', 'v', 'a', '_', 'ɑ', '̃', 's', 'y', 'i', 't', '_', 's', 'ə', '_', 'p', 'o', 'z', 'i', 's', 'j', 'ɔ', 'n', 'e', '_', 's', 'y', 'ʁ', '_', 'l', 'ə', '_', '(', 'en', ')', 'd', 'ɒ', 's', '(', 'fr', ')', '_', 'd', 'ə', '_', 'l', 'a', '_', 'f', 'ə', 'm', 'ɛ', 'l', '_', 'e', '_', 's', 'a', 'k', 'u', 'p', 'l', 'e', '.']
You see, the phonemes set has this (en) and (fr), this is because the phonemizer thinks there is a language switch. Of course these are not true phonemes tokens, in order to remove these, set the flag
If you go through steps/train_utils.py, you see that training batches are not created with fixed sizes. Training batches are created such that:
each batch process roughly max_token_num of tokens
all sequences in a batch have roughly the same lengths.
Once a batch is distributed to a GPU process, we further separated with multiple steps of gradient accumulations. However, for whatever reason, THIS DID NOT WORK WELL ON MY GPUS SET. I'm training on 8xL4. For whatever reason, I always get OOM error even if I set accumulation steps to very high number. Therefore, I rewrote portion of the sampler such that instead of having a large batch of 10000 tokens and then split the batch to 10+ small steps, I directly make the sampler to produce a batch with at most 1000 tokens, and I do gradient update every 10 batches. The difference between the two method is that now I can control exactly how many tokens I process within a single step. If you have smaller GPUs and encounter similar issues, you can do what I did.
Training
One thing I think it would be beneficial for people if @jasonppy you can put training curves in your paper or in the repository so we know what to expect. Since this is my first time training LLM. I have no idea what to expect. My training curve look like below after 5 days. I see top10 accuracy is 0.56, I thought this is horrible!! And for the past two days I've been reviewing / validating the entire data generation / training process. Today I start to wonder what is the actual loss/accuracy when the model is trained on gigaspeech. So I compute loss and acc on 4 gigaspeech example... it turns out the returned loss and acc is worse than the values I'm currently have.
Then I realize that you are not supposed to have super high accuracy in the first place, because there are infinite number of ways to say a piece of sentence...
how it works
It works as well (also share similar problem) as the model trained on english! Also since the dataset common voice french has its own problems, I think for a fully functional french model, we probably need to curate some higher quality dataset with more diverse intonation.
I guess now the biggest problem is that tempo of generated speech is not so realistic, especially with the long pauses. I know in the paper suggest to generate multiple of them and pick the shortest one. I'm wondering if we can do the following:
at generation time, set -nan to silence tokens logts if multiple silence tokens are generated one after another... but then you need to know at prior which portion is not supposed to have long pause... it can be something like post process:
generate utterance
run force alignment to get word timestamp
find unnecessary large gaps...
remove? or restart generation from the gap place?
seeems to take quite long time to do all these,
at training time, instead of putting one mask token, why don't we put in multiple tokens that is used to represent tempo and intonation of the portion... so at inference time these can be used as control tokens...
Hello,
So I finetuned voicecraft on the french common voice-french dataset. It's quite exciting since it's my first time working on LLM and on full audio model (not just spectrogram -> classification like doing image recognition )! I just want to share here some of my thoughts/findings/questions because I see many open issues about finetuning, hopefully @jasonppy can also provide some insights/ suggestions!
data preparation
I already answered under this issue https://github.com/jasonppy/VoiceCraft/issues/138. Again I want to emphasize that while the algorithm itself is more involved and the
model/voicecraft
is pretty hairy and intimidating, preparing finetuning data is really straightforward. Essentially you need to do the following:text embedding
weights if the total number of phonemes exceed the number the pretrained model uses:I want to address an issue I found when generating french phonemes.
VoiceCraft
generateIPA
phonemes using the packagephonemizer
, if you use the same piece of code to generate phonemes for your language, sometimes you will get this:You see, the phonemes set has this
(en)
and(fr)
, this is because the phonemizer thinks there is alanguage switch
. Of course these are not true phonemes tokens, in order to remove these, set the flagtraining code related
If you go through
steps/train_utils.py
, you see that training batches are not created with fixed sizes. Training batches are created such that:max_token_num
of tokensOnce a batch is distributed to a GPU process, we further separated with multiple steps of gradient accumulations. However, for whatever reason, THIS DID NOT WORK WELL ON MY GPUS SET. I'm training on
8xL4
. For whatever reason, I always get OOM error even if I set accumulation steps to very high number. Therefore, I rewrote portion of the sampler such that instead of having a large batch of10000
tokens and then split the batch to10+
small steps, I directly make the sampler to produce a batch with at most1000
tokens, and I do gradient update every10
batches. The difference between the two method is that now I can control exactly how many tokens I process within a single step. If you have smaller GPUs and encounter similar issues, you can do what I did.Training
One thing I think it would be beneficial for people if @jasonppy you can put training curves in your paper or in the repository so we know what to expect. Since this is my first time training LLM. I have no idea what to expect. My training curve look like below after 5 days. I see top10 accuracy is
0.56
, I thought this is horrible!! And for the past two days I've been reviewing / validating the entire data generation / training process. Today I start to wonder what is the actual loss/accuracy when the model is trained on gigaspeech. So I compute loss and acc on 4 gigaspeech example... it turns out the returned loss and acc is worse than the values I'm currently have.Then I realize that you are not supposed to have super high accuracy in the first place, because there are infinite number of ways to say a piece of sentence...
how it works
It works as well (also share similar problem) as the model trained on english! Also since the dataset common voice french has its own problems, I think for a fully functional french model, we probably need to curate some higher quality dataset with more diverse intonation.
I guess now the biggest problem is that tempo of generated speech is not so realistic, especially with the long pauses. I know in the paper suggest to generate multiple of them and pick the shortest one. I'm wondering if we can do the following:
-nan
to silence tokens logts if multiple silence tokens are generated one after another... but then you need to know at prior which portion is not supposed to have long pause... it can be something like post process:seeems to take quite long time to do all these,
Anyway, thanks again to @jasonppy for this work!