metavoiceio / metavoice-src

Foundational model for human-like, expressive TTS
https://themetavoice.xyz/
Apache License 2.0
3.48k stars 614 forks source link

Finetuining 1B first-stage on non-English datasets: thoughts #157

Open Ar4ikov opened 1 month ago

Ar4ikov commented 1 month ago

According to original discord message

Hello everyone! I am fine-tuning a model in a non-English language. The dataset consists of 200 hours of audio recordings. Using the guideline (https://github.com/metavoiceio/metavoice-src/issues/70#issuecomment-1957337895) and the latest updates, particularly the fam/llm/finetune.py script, I have set the following:

The model training is currently in progress, and I'm monitoring the training_loss, which is barely dropping below 2.000 (~2.200).

Regarding the dataset: it's a mix of various data, including:

What am I doing wrong or what did I do incorrectly? The results, in my opinion, will not be impressive. I currently don't have datasets with a mean duration of ~30s; will this significantly impact training, or did I do something wrong in the training process itself (not in the dataset selection stage)?

Also, I believe that such a high error during the training steps is not quite correct in terms of overall training. I would like to know your opinion on such a high error and what results you have had on your data for the latest published model checkpoints. I will provide the report from wandb in the thread!

Ar4ikov commented 1 month ago

As promised, here are the training results: https://api.wandb.ai/links/socialcode_donstu/9l38gko0

Additionally:

The result is not very impressive, as I suspected. The model now clones voice features but produces gibberish for non-English phrases. Moreover, the output for English has degraded significantly, with the model struggling to generate words or producing highly noisy outputs.

If you have questions about the setup used for training: 1x RTX 3090 @ 24GB batch_size: 32 total training size: 166,249 samples @ mean duration = 5s (~200 hours of speech total) 2 epochs with evaluation = ~19 hours

njawh commented 2 weeks ago

Hello. I read what you explained impressively.

I also want to fine-tune the model to a language other than English, In the part related to the Tokenizer among the explanations, can you tell me how you wrote the code for this part, "We trained a new BPE Tokenizer"?

I really appreciate your response.

Ar4ikov commented 1 week ago

@njawh Hello!

https://gist.github.com/Ar4ikov/8b22ee3ef952140611510b17c2f3f000