serp-ai / bark-with-voice-clone

🔊 Text-prompted Generative Audio Model - With the ability to clone voices
https://serp.ai/tools/bark-text-to-speech-ai-voice-clone-app
Other
3.11k stars 415 forks source link

Generation failed with clone_voice.ipynb #6

Open ryoppippi opened 1 year ago

ryoppippi commented 1 year ago

I cloned voice like this.

import numpy as np
voice_name = 'ryoppippi' # whatever you want the name of the voice to be
output_path = 'bark/assets/prompts/' + voice_name + '.npz'
np.savez(output_path, fine_prompt=codes, coarse_prompt=codes[:2, :], semantic_prompt=semantic_tokens)

Then, I tried to generate my voice following the notebook like this

from bark.api import generate_audio
from bark.generation import SAMPLE_RATE
text_prompt = "Hello, my name is Suno. And, uh — and I like pizza. [laughs]"
voice_name = "ryoppippi" # use your custom voice name here if you have one

# simple generation
audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)

And it causes error

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[18], line 2
      1 # simple generation
----> 2 audio_array = generate_audio(text_prompt, history_prompt=voice_name, text_temp=0.7, waveform_temp=0.7)

File /workspace/bark-with-voice-clone/bark/api.py:78, in generate_audio(text, history_prompt, text_temp, waveform_temp)
     66 """Generate audio array from input text.
     67 
     68 Args:
   (...)
     75     numpy audio array at sample frequency 24khz
     76 """
     77 x_semantic = text_to_semantic(text, history_prompt=history_prompt, temp=text_temp)
---> 78 audio_arr = semantic_to_waveform(x_semantic, history_prompt=history_prompt, temp=waveform_temp)
     79 return audio_arr

File /workspace/bark-with-voice-clone/bark/api.py:46, in semantic_to_waveform(semantic_tokens, history_prompt, temp)
     31 def semantic_to_waveform(
     32     semantic_tokens: np.ndarray,
     33     history_prompt: Optional[str] = None,
     34     temp: float = 0.7,
     35 ):
     36     """Generate audio array from semantic input.
     37 
     38     Args:
   (...)
     44         numpy audio array at sample frequency 24khz
     45     """
---> 46     x_coarse_gen = generate_coarse(
     47         semantic_tokens,
     48         history_prompt=history_prompt,
     49         temp=temp,
     50     )
     51     x_fine_gen = generate_fine(
     52         x_coarse_gen,
     53         history_prompt=history_prompt,
     54         temp=0.5,
     55     )
     56     audio_arr = codec_decode(x_fine_gen)

File /workspace/bark-with-voice-clone/bark/generation.py:477, in generate_coarse(x_semantic, history_prompt, temp, top_k, top_p, use_gpu, silent, max_coarse_history, sliding_window_len, model)
    475 x_semantic_history = x_history["semantic_prompt"]
    476 x_coarse_history = x_history["coarse_prompt"]
--> 477 assert (
    478     isinstance(x_semantic_history, np.ndarray)
    479     and len(x_semantic_history.shape) == 1
    480     and len(x_semantic_history) > 0
    481     and x_semantic_history.min() >= 0
    482     and x_semantic_history.max() <= SEMANTIC_VOCAB_SIZE - 1
    483     and isinstance(x_coarse_history, np.ndarray)
    484     and len(x_coarse_history.shape) == 2
    485     and x_coarse_history.shape[0] == N_COARSE_CODEBOOKS
    486     and x_coarse_history.shape[-1] >= 0
    487     and x_coarse_history.min() >= 0
    488     and x_coarse_history.max() <= CODEBOOK_SIZE - 1
    489     and (
    490         round(x_coarse_history.shape[-1] / len(x_semantic_history), 1)
    491         == round(semantic_to_coarse_ratio / N_COARSE_CODEBOOKS, 1)
    492     )
    493 )
    494 x_coarse_history = _flatten_codebooks(x_coarse_history) + SEMANTIC_VOCAB_SIZE
    495 # trim histories correctly

AssertionError: 

I'm not sure what happens

ryoppippi commented 1 year ago

audio.webm

Here is my audio file(sorry for my bad pronunciation :( )

darkpanther99 commented 1 year ago

Same issue for me, but with using a basic voice name as well.

darkpanther99 commented 1 year ago

I have found the solution. For the voice name you should specify the full path to the voice file, not only its name. For me: voice_name = 'bark/bark/assets/prompts/' + "karoly" + '.npz'

Jxspa commented 1 year ago

I have the same issue. The above changes nothing for me. Some files work while some throw the error, even though they're recorded the same way and the same length.

When it does work, the results are not very similar to the target at all.

NgHanWei commented 1 year ago

The error likely stems from this line 499 in generate.py: round(x_coarse_history.shape[-1] / len(x_semantic_history), 1) == round(semantic_to_coarse_ratio / N_COARSE_CODEBOOKS, 1)

ryoppippi commented 1 year ago

@darkpanther99 My npz file is successfully stored in bark/assets/prompts/, so the path specification doesn't solve my issue But thanks for suggestion

ryoppippi commented 1 year ago

I think this is bark's problem because this repo adds only interface of bark

Fortyseven commented 1 year ago

As a temporary fix, you can just yank the assert block. Clearly that's not ideal, since that means something is wonky, but at least things start working again.

permissionBRICK commented 1 year ago

I commented out the last and block in the assert, and it seems to be working for now

NgHanWei commented 1 year ago

Adding 'MAN: ' or 'WOMAN: ' in the text transcript helped to fix the problem for me.

ryoppippi commented 1 year ago

Thanks. It works when comment outing assert block

ryoppippi commented 1 year ago

The result is hmm...

troed commented 1 year ago

I've run this without any changes (both from notebook as well as from a standalone python file) and it runs without errors. The voice file sounds nothing like me though ;) I've used the same source wav to train Tortoise-TTS which sounds amazing.

I assume something's going wrong, but there's no error output and all steps complete as they should.

loboere commented 1 year ago

it works but the results are nothing like the target voice

ThereforeGames commented 1 year ago

Are you sure you're using the bark files included with this repo, not the standard files from the pip install command? You might need to replace the files by hand.

it works but the results are nothing like the target voice

Agreed, it barely seems to be influenced by the input samples at all. Wonder if I'm missing something.

Jxspa commented 1 year ago

Agreed, it barely seems to be influenced by the input samples at all. Wonder if I'm missing something.

I've seen ~20 people all saying the same thing. I hope we are missing something but I doubt it.

ThereforeGames commented 1 year ago

The default parameters for generate_text_semantic() might be a bit off. I tried raising the temperature from 0.7 to 0.95 which seemed to help. May need to experiment with top_k and top_p as well.

pomela0516 commented 1 year ago

According to Readme, currently it may seems that it is not allowed unless it is npz provided by Suno. image

ThereforeGames commented 1 year ago

According to Readme, currently it may seems that it is not allowed unless it is npz provided by Suno. image

This repo patches in voice cloning unofficially.

image

G-force78 commented 1 year ago

When I run it the cell runs but nothing happens..I guess Im missing something somewhere. Oh well back to tortoise.

ricardojuerge735 commented 1 year ago

Surprisingly I managed to get all of this up and running (have literally zero experience with all this). But the output audio of a voice clone has absolutely nothing to do with the input voice. Something is not quite right.

ryoppippi commented 1 year ago

@troed cool. Could you give us how you Dudu it??

troed commented 1 year ago

@ryoppippi Getting it to run? I just opened the notebook in Jupyter and 1) ran it as well as 2) copied it out to its own .py and ran it from cli.

But just as everybody else my results are ... bad. The generated voice sounds nothing like me (my example). Now, granted, there's an update to the repo saying that the results are better with very short samples (like, 2-3s). I haven't tried that, but also I can't understand how that would work really.

ryoppippi commented 1 year ago

@troed Oh, I see. Then what you did is the same as I did.

I am Japanese and I tried to generate voice using the Japanese model provided by suno. The generated voice contained a lot of noise like We may not be able to expect as good results as we think. Maybe there are some things we can CONTRIBUTE to the original implementation. We shall see.

Ph0rk0z commented 1 year ago

I tried cloning off of a 20 sec audio file. In tortoise this works great. I was hit with the assertion error and had to comment it out. Adding "man:" or whatever to prompt in any case didn't work. Also had to make the "notebooks" into scripts because I am not running this on some collab or other service.

20 second voice file was 160k

8 second was 55k, more in line with other voices

The process of generating a wav took up about 5GB of vram and 3 minutes.

I cloned a male voice and got a woman. I did it twice.. the first sample sounded like female version of the person. Weird.

binarycache commented 1 year ago

For me, the solution was to limit the recording to 4 seconds, however, as others have said, the cloning doesn't work at all. I tested it for English and hindi both and it fails miserably.

francislabountyjr commented 1 year ago

I think it's going to take finetuning to get a consistent voice clone out of the current models. We are working on that now!

giorgionetg commented 1 year ago

I've no memory... it's no possible to train with only 4Gb vram... is there a way?

Hit1ron commented 1 year ago

I think it's going to take finetuning to get a consistent voice clone out of the current models. We are working on that now! @francislabountyjr hi! Is there any progress on this matter?

abhinavasr commented 1 year ago

you need an audio/text pair of less than 7 seconds