gitmylo / bark-voice-cloning-HuBERT-quantizer

The code for the bark-voicecloning model. Training and inference.
MIT License
670 stars 111 forks source link

Fine-tune for a certain speaker #18

Closed renxiangnan closed 1 year ago

renxiangnan commented 1 year ago

Thanks for this great work. I am wondering, if I want to increase the quality of voice cloning for a certain speaker, is there a way to fine-tune the model? If yes, how should I do it? Thank you.

gitmylo commented 1 year ago

this project, as you might be able to see from the requirements.txt, does not use bark by itsself, it creates compatible speaker prompt files.

due to the way it is trained (which is the only possible way as far as i know that you could even do this step.

training for a specific speaker alone is not possible. technically it is, but your results could never be better than a model trained on multiple speakers. as the only way to train such a model, is by training on the outputs of the previous model.

in short: No, you cannot fine-tune for a certain speaker, you are not training bark here, you are training speech recognition to trick bark into continuing a voice that it did not generate.

renxiangnan commented 1 year ago

Really appreciate for your reply. I am very new to audio and speech processing, and thanks for your patience. I read the link you sent over previously, very helpful, and I also tried to explore both Bark and AudioLM. Following up your answer, I am trying to use their official model as a base model, and fine-tune Bark on a certain speaker. Looks like the training scripts is not available in their official pages, and Bark authors suggest to follow AudioLM for training on custom dataset. I know this question is not related to this project, it would be very helpful if you can provide me some guidance there. Thank you.

gitmylo commented 1 year ago

some people have managed to fine-tune bark, they usually used the default HuBERT quantizer for it though (only 500 options, as opposed to the 10000 from bark and this quantizer)

i'm not sure which projects they used for finetuning, but maybe it's helpful to look at both AudioLM and NanoGPT, NanoGPT has training code available, and is the language model that was used in bark, for all three of the bark models used for a generation (semantic, coarse, fine)

renxiangnan commented 1 year ago

Thanks for your previous explanation and guidance. I have spent some efforts to explore bark, nanoGPT and AudioLM.

Based on my understanding (please correct me if I am wrong), Bark has a very similar architecture as AudioLM: instead of wav to semantic steps in AudioLM, bark does text to semantic. I think for the other two steps (i.e., semantic to coarse and coarse to fine -> decode to wave) should be the same. However, could you please give me more guidance here? To train this nanoGPT transformer in bark, for a text to semantic generation, my thought is:

  1. the so called "text to semantic transformer" in bark is actually a nanoGPT takes input as a classic text to semantic model (for text to semantic) + your hubert quantizer (for waveform to semantic),
  2. use existing text to semantic tool like encodec and this repo, with a given audio, get input embedding, follow the nanoGPT training steps
  3. Following the training steps in AudioLM for coarse and fine generation. For all the training in bark, I can use the loss as mentioned AudioLM.

Thank you in advance and appreciate for your answer and help.

gitmylo commented 1 year ago

AudioLM itsself is able to generate audio from text. It's not limited to semantic to audio. (just like Bark)

  1. correct
  2. encodec's codebooks are very different from semantics, they are not flat, and they work very differently
  3. you can use the quantizer outputs for calculating loss during training

https://github.com/serp-ai/bark-with-voice-clone already implements bark fine-tuning now though, i would look at that.

renxiangnan commented 1 year ago

Thank you so much, this is really helpful!

renxiangnan commented 1 year ago

Hi Mylo, I spent some effort to test and explore this repo https://github.com/serp-ai/bark-with-voice-clone , thank you it is really helpful . I tried to post my questions there but got no response from the authors. I am wondering if you are experienced enough to give me some guidance here, and I appreciate your help.

I am able to fine-tune the speaker's voice, 5 - 10 mins records give me a noticeable improvement in terms of prosody and intonation. However, I did try to retrain the mark model from scratch by myself. Trained Semantic GPT on LibriTTS with 400K steps, optimize all the parameters of the model and reduce lora rank to 0. Screenshot 2023-08-13 at 11 03 56 AM

However, with retrained Semantic GPT + Pretrained Coarse and Fine GPT, the model is unable to produce any "soundable" speech, just pure noise. I also tried to change the hyper params a bit but it was not helpful. Do you have any suggestions? Thanks again for your valuable time.

On Mon, 3 Jul 2023 at 01:30, Mylo @.***> wrote:

AudioLM itsself is able to generate audio from text. It's not limited to semantic to audio. (just like Bark)

  1. correct
  2. encodec's codebooks are very different from semantics, they are not flat, and they work very differently
  3. you can use the quantizer outputs for calculating loss during training

https://github.com/serp-ai/bark-with-voice-clone already implements bark fine-tuning now though, i would look at that.

— Reply to this email directly, view it on GitHub https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/issues/18#issuecomment-1616826043, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK5DZUINYAMBSYWWW7TYHDXOHR6XANCNFSM6AAAAAAZRGXYKM . You are receiving this because you authored the thread.Message ID: @.*** com>

gitmylo commented 1 year ago

Speaker fine-tuning is a hit or miss with bark. As far as I know it's mostly up to luck, and you must have a really good dataset.

renxiangnan commented 1 year ago

Thank you Mylo for this reply, I understand better now.

On Sun, 13 Aug 2023 at 16:37, Mylo @.***> wrote:

Speaker fine-tuning is a hit or miss with bark. As far as I know it's mostly up to luck, and you must have a really good dataset.

— Reply to this email directly, view it on GitHub https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/issues/18#issuecomment-1676347528, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADK5DZXMCS2EOZNXWWUZCY3XVDDAJANCNFSM6AAAAAAZRGXYKM . You are receiving this because you authored the thread.Message ID: @.*** com>