Inconsistent generation

sw4rm3r commented 1 year ago

Hello, I succesfully cloned my voice but the results are pretty inconsistent. I tried by cloning with samples of 2 seconds, 3 seconds, 5 and even 7 but nothing seems to work. I'll explain better, after I cloned my voice, if I try to generate an audio file one of these things happen:

The audio is understandable but it's definitely not my voice
I get random noises like high whistles, music or buzz sounds
I hear something very close to my voice but it just emits long "hmmmmm....." like sounds, no matter what I write

What are you guys experience on cloning voices? Is there some parameters we can set or some specific phrases that would help the voice cloning process being better?

I think we should create some space like a subreddit or a discord to share our prompts and experience in order to refine the voice cloning process

Pathos14489 commented 1 year ago

I'm also having all of these problems no matter what length of dataset I try. The voice that has the least problems has way over what most people recommend at about 2 minutes of audio and text, but it's still not at all close.

C0untFloyd commented 1 year ago

AFAIK the voice cloning is more or less guesswork so far and it's missing the secret sauce. I also have all the issues from above and on my low-level GPU, voice cloning is faster than generating audio. I believe this can't be right and the results are more or less gibberish, even for totally different audio clip lenghts. I'm a beginner in the ML Game but I assume there need to be more training epochs or something similar. Perhaps a more experienced guy could take a look through the Vall-E papers as this Software is at least inspired by it.

CarlKenner commented 1 year ago

AFAIK the voice cloning is more or less guesswork so far and it's missing the secret sauce.

I agree.

I'm a beginner in the ML Game but I assume there needs to be more training epochs or something similar. There are zero training epochs. The voices aren't neural networks. Voices are just "prompt history". They encode the previous sentence and how it was pronounced. The bark neural networks work by taking the meaning and sounds of the previous sentence, while also taking the text of the current sentence, and predicting the sounds of the next sentence.

The voice file is a .npz file, which is a gzip archive of three numpy arrays. The three arrays are the output of the "text" model (the largest model that does most of the work), and the outputs of the "coarse" and "fine" models which generate the actual sounds. It's trivial to convert the sound into coarse and fine outputs because Facebook has an EnCodec model to do that.

The problem is that a large part of the voice's characteristics come from the outputs of the "text" model, and we have no way of getting the correct "text" outputs to match the voice we are trying to clone. For example, notice how if you use a female voice but put "MAN: " in the text, it will switch to man's voice. That tells you that a lot of the characteristics of the voice are coming from the "text" model.

Currently, the "cloning" code just generates the "text" model outputs by running the text through the model without specifying a voice. That means none of the voice's characteristics are included in the text model outputs.

Ideally, we need to run the model in reverse to generate the outputs of the "text" model based on our sound file.

Other options would be to start with a voice that sounds similar, or at least is the same gender, and use that to generate the "text" values.

Or we could add a lot of randomness to the "text" model, generate a hundred variations of the voice, and test which (if any) produce a voice that sounds similar to the target.

It's also possible to use any speech produced by Bark as a voice. So if you can get a voice that intermittently sounds like the target, then you can use take one of the successful generations and use that generation as the new voice (provided you saved the intermediate outputs when you generated it). That process could be repeated indefinitely.

francislabountyjr commented 1 year ago

We are working on finetuning, agreed that the current model seems a bit unstable. Finetuning on a single voice/music/whatever should produce more consistent results. Should also be pretty easy to introduce LoRA in to the training for faster results while using less memory.

C0untFloyd commented 1 year ago

@CarlKenner thanks for the good explanation

I'm seeing the talks about Lora here, which I previously only know from Stable Diffusion. To get a better understanding, can the .pt Models from Bark be compared to a checkpoint from SD and would the .nbz files be something like the textual inversion, Lora or Hypernetwork files? That would mean, the model from Suno would need to have similar sounding voices to successfully be trained, wouldn't it?

Another (perhaps wrong) idea: what about the audio-format? The resulting audio from Bark is mono, 32float, 24000 samplerate, the classifier works with a 16000 samplerate. If I'm feeding it with a standard Windows Stereo PCM WAV of 44100 samplerate 16, will this be converted to the correct format? I didn't see a discussion about this anywhere...

CarlKenner commented 1 year ago

To get a better understanding, can the .pt Models from Bark be compared to a checkpoint from SD

Yes.

and would the .nbz files be something like the textual inversion, Lora or Hypernetwork files?

No. But out of those, textual inversion would be the closest. The .nbz files are more like the image that you submit to img2img mode of Stable Diffusion.

But Bark is less like Stable Diffusion and more like ChatGPT (and even more like the older non-chat GPT-3 version that did text completion). The .npz file is just the previous sentence plus some codes the model used to generate the previous sentence.

That would mean, the model from Suno would need to have similar sounding voices to successfully be trained, wouldn't it?

If you train it with LoRA, maybe, or maybe not. If you train it with finetuning, no. If you're talking about the current voice cloning system which has nothing to do with training the model, yes.

If I'm feeding it with a standard Windows Stereo PCM WAV of 44100 samplerate 16, will this be converted to the correct format?

Yes, I think so.

gab-luz commented 1 year ago

I could generate voices... but none of them seem like the ones used in Suno's Bark Hugging Face Spaces, none of them... that's pretty disappointing and I don't think this repo is to blame because the original repo also gives me the same results... I'm really thinking there must be something missing... probably the Bark Spaces is using different stuff and I've even checked the script... not different... but why we can't have a consistent generation? well, only Suno's crew knows probably

C0untFloyd commented 1 year ago

because the original repo also gives me the same results

There is no original repo with voice cloning, or is it?

Did someone try using longer audio clips for cloning - e.g. 90+ seconds? YourTTS seems to use this as a good timespan and this hf space mixed it with Bark: https://huggingface.co/spaces/kevinwang676/Bark-with-Voice-Cloning

gab-luz commented 1 year ago

OK, I'll try that. I'm creating a voice assistant with some pre-made audio files since BARK consumes too much vram and even with a 16 VRAM gpu it takes too long to generate, sadly. I hope they can optimize it over time because undoubtedly this is the best tts so far for me.

faraday commented 1 year ago

@C0untFloyd but isn't that space https://huggingface.co/spaces/kevinwang676/Bark-with-Voice-Cloning failing with: 'NoneType' object is not subscriptable

gab-luz commented 1 year ago

@C0untFloyd but isn't that space https://huggingface.co/spaces/kevinwang676/Bark-with-Voice-Cloning failing with: 'NoneType' object is not subscriptable

Same for me.

C0untFloyd commented 1 year ago

That space isn't by me nor do I know the guy running it. I just tried very brief to clone audio with YourTTS on that page, which did work 🤷‍♂️ I do have a UI Version for cloning locally though here: https://github.com/C0untFloyd/bark-gui

wolfgangmeyers commented 1 year ago

Some of the audio I generated after cloning my voice sounded just like me, but it's inconsistent. Generating with en_speaker_2 gives pretty consistent results. If someone comes up with a way to get more consistent output from a voice clone I'd love to try it out.

Yusuf-YENICERI commented 1 year ago

I tried for Turkish and the result is, i can say the sound is kinda similar but what it says is irrelevant(it doesn't make any sense at all). Maybe language problem. I'm not sure.

Lebski commented 1 year ago

It's also possible to use any speech produced by Bark as a voice. So if you can get a voice that intermittently sounds like the target, then you can use take one of the successful generations and use that generation as the new voice @CarlKenner

How would that work?

    audio_array = generate_audio(text_prompt, history_prompt=old_history_path, text_temp=0.7, waveform_temp=0.7)

    # save for new history prompt
    np.savez(new_history_path, audio=audio_array)

Like this? And in the next itteration I use the new_history_path instead of the old_history_path ?

C0untFloyd commented 1 year ago

I tried for Turkish and the result is, i can say the sound is kinda similar but what it says is irrelevant(it doesn't make any sense at all). Maybe language problem. I'm not sure.

You need to use a turkish tokenizer to create something other than gibberish. The one used here probably uses english. You can train one yourself, start here

Like this? And in the next itteration I use the new_history_path instead of the old_history_path ?

Yes, that's the old way of approximating the voice before the new cloning method was created.

Lebski commented 1 year ago

new cloning method

~With "new" do you mean the RVC?~

What do you mean by new cloning method?

C0untFloyd commented 1 year ago

Oh wow, I didn't follow the development of the fork here that closely so I missed the possibility of using RVC now. What I meant with new cloning is the one used in the current colab notebook and I linked the original repo above in my comment about training turkish language. Before that guy created the new method, cloning voices resulted in barely usable stuff.

serp-ai / bark-with-voice-clone

Inconsistent generation #10