metavoiceio / metavoice-src

Foundational model for human-like, expressive TTS
https://themetavoice.xyz/
Apache License 2.0
3.72k stars 645 forks source link

Rough fine-tuning guidance #70

Open RonanKMcGovern opened 6 months ago

RonanKMcGovern commented 6 months ago

I know the repo ReadMe says "soon", but would it be possible to give some very rough advice on how to fine-tune to improve on the voice's match with a custom speaker?

I guess the demo is just extracting embeddings from bria.mp3, but I'd like to go one step further to get a better voice match. Thanks.

vatsalaggarwal commented 6 months ago

Hey! Yes, when we released the repo, we thought it'd be the next thing we release, but we've reprioritised, so are busy with other things, and we've not actively started working on releasing finetuning support.

In terms of the voice cloning, have you tried to embed the voice you're trying to clone? What was the issue?

In general, we've found that the following works best:

In terms of what's required for finetuning, there are 4 models chained together: i) first stage (from text -> 2 hierarchies of encodec), ii) second stage (2 hierarchies of encodec -> rest 6 hierarchies of encodec), iii) mbd (8 hierarchies of encodec -> waveform), iv) deepfilternet (cleanup; waveform -> waveform). In our testing, we found that the second stage is fairly robust to speakers, and accents; we haven't extensively tested it for non-English languages. So depending on what you're trying to do, I'd recommend focusing on finetuning primarily the first stage model 1B param model.

For that, we need:

I might be missing a few things here as I'm putting this down from memory, but happy to assist with things as they come up. Sorry about the delay on this from our end, but we equally welcome contributions, and would move to support that instead!

RonanKMcGovern commented 6 months ago

Many thanks @vatsalaggarwal .

Re embeddings, I used a 90 s recording of my voice. I used an mp3 file, so perhaps I could have done better there, but I'm Irish so maybe that was the issue in getting a good match. I'll try out an American person's embeddings instead and see.

Yeah, I think I've got it on training. I guess the first stage produces encodec tokens? What do I use to decode those?

BTW, what model are you using for diffusion in the third stage, I don't see that on the model card on HF, but maybe I glanced over it? Thanks

vatsalaggarwal commented 6 months ago

Yeah, and a 44.1/48khz wav or 256kbps mp3 works better… but, it’s unlikely the model would be able to produce an Irish accent regardless in one-shot, at least we didn’t focus on it for this release.

The first stage produces the first two codebooks of the encodec RVQ.

you can decode these using the second-stage + MBD + deepfilternet.

We are using multi band diffusion from audio craft.

More details are available in https://github.com/metavoiceio/metavoice-src?tab=readme-ov-file#architecture and https://github.com/metavoiceio/metavoice-src/issues/70#issuecomment-1957337895

deeprobo-dev commented 6 months ago

Hi if I want to fine tune for different language other than English then which stage is most suitable for fine tuning, can you please share some insights on that? Also, I working on robotics application and would love to see faster inference with as much lesser vram as possible as I have to run it in a nvidia agx or orion.

Thanks for the amazing project

hrachkovinovoto commented 6 months ago

Hi if I want to fine tune for different language other than English then which stage is most suitable for fine tuning, can you please share some insights on that? Also, I working on robotics application and would love to see faster inference with as much lesser vram as possible as I have to run it in a nvidia agx or orion.

Thanks for the amazing project

+1, It'd be great if it manages to run on 8 GB VRAM at the very least.

vatsalaggarwal commented 6 months ago

Hi if I want to fine tune for different language other than English then which stage is most suitable for fine tuning, can you please share some insights on that? Also, I working on robotics application and would love to see faster inference with as much lesser vram as possible as I have to run it in a nvidia agx or orion.

I think you'd need to finetune both the first stage for sure, and I'm not sure if second stage needs to be finetuned, that's something you'd have to check.

Thanks for the request regarding faster inference and on-device! How much VRAM and compute (TFLOPs) are available?

deeprobo-dev commented 6 months ago

I think you'd need to finetune both the first stage for sure, and I'm not sure if second stage needs to be finetuned, that's something you'd have to check.

Thanks for the request regarding faster inference and on-device! How much VRAM and compute (TFLOPs) are available?

Thanks for your insights regarding fine tuning. I will give it a try. For NVIDIA JETSON AGX XAVIER its of 32 TFLOPS and for NVIDIA JETSON AGX ORIN is of 275 TFLOPS.

maepopi commented 6 months ago

Hello! If I understood correctly from this thread, it will soon be possible to finetune a model both with a model checkpoint and a LoRA with 12Gb VRAM (and possibly 8gb?)? How large the audio dataset should be for each?

thank you, this is very exciting!

RonanKMcGovern commented 6 months ago

Hey! Yes, when we released the repo, we thought it'd be the next thing we release, but we've reprioritised, so are busy with other things, and we've not actively started working on releasing finetuning support.

Just to add to this. I wouldn't underestimate how valuable it would be to release fine-tuning in a simple way. I'm not aware of frameworks that accurately can fine-tune for a specific voice. I'd be keen to make a video on this on youtube.com/@trelisresearch if this becomes possible - it allows for things like making your own audio books.

maepopi commented 6 months ago

Just to add to this. I wouldn't underestimate how valuable it would be to release fine-tuning in a simple way. I'm not aware of frameworks that accurately can fine-tune for a specific voice. I'd be keen to make a video on this on youtube.com/@TrelisResearch if this becomes possible - it allows for things like making your own audio books.

If I might, have you checked out this tool? it's based off Tortoise-TTS and it's really good. I've been playing around with it for months and come up with pretty good models. I don't think it supports LoRA's though, and I'm starting to think that maybe you need rather large datasets for finetuning, which is why I'm very interested in the present repo. In addition, Metavoice seems to provide with a slightly better base model than Tortoise (but this would need to be tested further, it's just an impression for now).

danablend commented 6 months ago

Yeah, and a 44.1/48khz wav or 256kbps mp3 works better… but, it’s unlikely the model would be able to produce an Irish accent regardless in one-shot, at least we didn’t focus on it for this release.

The first stage produces the first two codebooks of the encodec RVQ.

you can decode these using the second-stage + MBD + deepfilternet.

We are using multi band diffusion from audio craft.

More details are available in https://github.com/metavoiceio/metavoice-src?tab=readme-ov-file#architecture and #70 (comment)

Hey! Thanks very much for your insight.

I'm in the midst of attempting to implement fine tuning, and I've gotten a very simple script to train but I could only get it to work by sequentially iterating over each data entry in a batch.

Would you happen to have an idea of how it might look adding batch inference support, so it can train on batches at a time during training / fine tuning?

Again, very much appreciate your work!

vatsalaggarwal commented 6 months ago

@danablend can you push your code to PR? I can have a look

danablend commented 6 months ago

@danablend can you push your code to PR? I can have a look

Hey @vatsalaggarwal, I realized that I had made a mistake and needed to build more code to make the training work.

I've spent a few hours working on it, but I get OOM when attempting to train the model with the gradients enabled on an A10G (16 GB VRAM), so I don't know yet if it works to push to the codebase. How many GB VRAM did you find you needed to train the model?

I'll play around with it more and see if I can prepare something useful and clean for you as a base to work off if that would be helpful, and I can open that as a PR?

vatsalaggarwal commented 6 months ago

Really hard to know where the issue is without seeing the code! I think it should be possible to finetune on a 16GB GPU but depends on your config (batch size, optimiser choice, etc)...

If it's in a state where I'll be able to run it, that's ideal, but also happy to look at it in it's current state (even if it's in a super dirty state) and give pointers to speed you up.

danablend commented 6 months ago

Really hard to know where the issue is without seeing the code! I think it should be possible to finetune on a 16GB GPU but depends on your config (batch size, optimiser choice, etc)...

If it's in a state where I'll be able to run it, that's ideal, but also happy to look it in it's current state and give pointers to speed you up even if it's in a super dirty state.

I'll open a PR shortly

danablend commented 6 months ago

Really hard to know where the issue is without seeing the code! I think it should be possible to finetune on a 16GB GPU but depends on your config (batch size, optimiser choice, etc)...

If it's in a state where I'll be able to run it, that's ideal, but also happy to look at it in it's current state (even if it's in a super dirty state) and give pointers to speed you up.

Just added a PR draft (https://github.com/metavoiceio/metavoice-src/pull/82).

G-force78 commented 6 months ago

Thanks for all the work on the training script its way beyond my ability, I am testing it now and am not sure what --val should be pathed to? Is it the dataset csv file? I have also pathed --train to that as well. Also, is there a way to set learning rate and number of steps? (Found it, its in .fam/llm/config/finetune_params.py). From experience using tortoise I found 2000-2500 steps was a range to aim for when training 20minutes of clean audio with no silences.

Edit: ok so that was correct you have to set the argument --train dataset.csv --val valdataset.csv
No how to save every so often eg 1/4 of way 1/2 way then 3/4 to completion. so far so good...using t4 on google colab memory

Training: loss 5.9145, time 37337.25ms: : 401it [06:31, 11.51s/it]iter 400: loss 5.9145, time 37337.25ms Training: loss 5.9145, time 37337.25ms: : 401it [06:31, 1.02it/s]

Where do I place the model? App.py is not working at the moment.

Traceback (most recent call last): File "/content/metavoice-src/app1.py", line 12, in from fam.llm.sample import ( ModuleNotFoundError: No module named 'fam.llm.sample'