suno-ai / bark

🔊 Text-Prompted Generative Audio Model
MIT License
35.95k stars 4.24k forks source link

How to load NPZ file of my voice ? #379

Open RageshAntony opened 1 year ago

RageshAntony commented 1 year ago

I created a NPZ file via this site https://huggingface.co/spaces/fffiloni/clone-voice-for-bark

Then I put it in the /assets/prompts/v2/ as ragesh.npz

Then I loaded it like this

audio_array = generate_audio(text_prompt, history_prompt="v2/ragesh")

But I get

ValueError: history prompt not found

Then I tired like audio_array = generate_audio(text_prompt, history_prompt="/path/to/../v2/ragesh") and still the same error

Then I tried like audio_array = generate_audio(text_prompt, history_prompt="/path/to/../v2/ragesh.npz")

Then I got

100%|██████████| 471/471 [00:06<00:00, 75.90it/s]
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
[<ipython-input-7-defec2c3955c>](https://localhost:8080/#) in <cell line: 13>()
     11      But I also have other interests such as playing tic tac toe.
     12 """
---> 13 audio_array = generate_audio(text_prompt, history_prompt="/content/bark/bark/assets/prompts/v2/ragesh.npz")
     14 
     15 # save audio to disk

2 frames
[/usr/local/lib/python3.10/dist-packages/bark/generation.py](https://localhost:8080/#) in generate_coarse(x_semantic, history_prompt, temp, top_k, top_p, silent, max_coarse_history, sliding_window_len, use_kv_caching)
    569             and x_coarse_history.max() <= CODEBOOK_SIZE - 1
    570             and (
--> 571                 round(x_coarse_history.shape[-1] / len(x_semantic_history), 1)
    572                 == round(semantic_to_coarse_ratio / N_COARSE_CODEBOOKS, 1)
    573             )

AssertionError:

Please help me to load NPZ file of my voice

cybershrapnel commented 1 year ago

This is easy. Open you generation.py file. You will see it is expecting the file to be named with a language prename, and underscore, and then and n number. you need to later the range of the array to go from 1 to 10 to go from 1 to 11. and then you need to name your new npz file as en_speaker_10.npz or 11 if you have more etc and adjust the range of the array proper. You aren't telling your script where the npz file is, is your issue.

Here is example of my code change is the only change needed where it says range(11) and then name your file right. or if you want to call it something else, then code that here.

starting at line 74 in my version at least of generation.py

ALLOWEDPROMPTS = {"announcer"} for , lang in SUPPORTED_LANGS: for prefix in ("", f"v2{os.path.sep}"): for n in range(11): ALLOWED_PROMPTS.add(f"{prefix}{lang}speaker{n}")

RahulBhalley commented 1 year ago

Hey @RageshAntony!

Sorry I don't have answer for you (I just started exploring this repo today). I wanted to know what's the code behind https://huggingface.co/spaces/fffiloni/clone-voice-for-bark?

Best, Rahul

cybershrapnel commented 1 year ago

also, i tried that npz generator, I have not been able to produce a working npz with it. different errors every time... it generates the npz but they don't work...

cybershrapnel commented 1 year ago

a little update. i followed the api in that link to this endpoint https://fffiloni-clone-voice-for-bark.hf.space/ and it did generate a working npz, and it took a lot longer. I think the other one isn't passing the audio file.?? But it still didn't work work. It was very garbled at beginning, and then still sounded like voice 6 but deeper

RageshAntony commented 1 year ago

and it did generate a working npz, and it took a lot longer. I think the other one isn't passing the audio file.??

How did you use it? I renamed as "en_speaker_10.npz" and loaded as "v2/en_speaker_10", but still get "history prompt not found" error

RahulBhalley commented 1 year ago

I renamed as "en_speaker_10.npz" and loaded as "v2/en_speaker_10", but still get "history prompt not found" error

Same issue with me.

RahulBhalley commented 1 year ago

Ahh, the issue is in line 71 to 75 in generation.py. The ALLOWED_PROMPTS set variable restricts the name of speakers so ours is not included in it and ValueError("history prompt not found") is being raised.

RahulBhalley commented 1 year ago

The voice cloning is not working. :(

RageshAntony commented 1 year ago

@RahulBhalley I deleted that IF block.But now get assertion error

Maybe some issue with Generated NPZ or bark not supporting it

cybershrapnel commented 1 year ago

u two didn't listen to a word I said. You need to edit the generation.py to allow the array to goto 11 ffs otherwise rename your npz file as number 9 or 8 if you don't know how to edit an array. But you shouldn't be editing this py file if you don't know how to read an array..

cybershrapnel commented 1 year ago

This was the best result I could get https://github.com/suno-ai/bark/assets/17352697/53e6844d-4931-48ae-94b4-5303a5d5327a

cybershrapnel commented 1 year ago

it played a lot of music and was not the voice it was suppose to be even when prompted to be male so.. i dunno

cybershrapnel commented 1 year ago

howveer, that does mean that npz generator is working, I suspect it is just very picky, ie, you need to speak more clear, remove background noise, and maybe say a specific phrase instead of a generic random one

cybershrapnel commented 1 year ago

if you want to follow my progress on this I've been actively working on this for a bit now :P

https://www.xtdevelopment.net/audio/

cybershrapnel commented 1 year ago

ok, i got a better result this time. but it still makes a lot of music noises and it was my voice I used to make the npz file and I prompted bark with [man] in the text and it still sound like voice 6 female... So I dunno... https://github.com/suno-ai/bark/assets/17352697/bf6ab70e-4f65-4bcf-a027-f468b3729514

RageshAntony commented 1 year ago

@cybershrapnel I already did that

added till 15

ALLOWED_PROMPTS = {"announcer"}
for _, lang in SUPPORTED_LANGS:
    for prefix in ("", f"v2{os.path.sep}"):
        for n in range(15):
            ALLOWED_PROMPTS.add(f"{prefix}{lang}_speaker_{n}")

print(ALLOWED_PROMPTS)

output includes the v2/en_speaker_10 also en_speaker_10.npz.zip

But still I got "History prompt not found"

Then only I removed the IF block that throws the error

Then I got assertion error

I attached the NPZ file for reference

cc: @RahulBhalley

cybershrapnel commented 1 year ago

thats wrong

cybershrapnel commented 1 year ago

u dont need to edit your generation.py file if you don't understand, set it back the way it was, rename ur file as en_speaker_9.npz do not change the generation.py and call in script like this

Set up sample rate

SAMPLE_RATE = 22050 HISTORY_PROMPT = "en_speaker_9" SPEAKER=HISTORY_PROMPT

or if you really want it in the v2 folder

Set up sample rate

SAMPLE_RATE = 22050 HISTORY_PROMPT = "v2/en_speaker_9" SPEAKER=HISTORY_PROMPT

the v2 thing is not important. thats the directory

RahulBhalley commented 1 year ago

If you don’t care about using other speakers already present, simply use any of those names.

My issue was that voice didn’t sound like it should have.

cybershrapnel commented 1 year ago

same, the npz is not correct or the models don't support it, not sure which

RahulBhalley commented 1 year ago

I think the correct npz file is being loaded. And the model must support every voice if it’s trained on humongous dataset like VALL-E.

But I’m doubtful about the way the latent features from voice are extracted. Maybe that part has some issue. It’s not able to fetch the timbre information from my voice.

Furthermore, sometimes same speaker sounds different. The team should give some argument to control the randomness of every inference like HuggingFace gives for Stable Diffusion (that generator argument). Bark won’t be useful if the speaking style and voice will always vary across every inference (i.e. if it’ll be unpredictable every time).

cybershrapnel commented 1 year ago

agreed, but Ive heard examples with bark using other voices. so... Does anyone have any example npz files we can play with?

RageshAntony commented 1 year ago

@cybershrapnel

Did the same. But still getting Assertion error

image

RahulBhalley commented 1 year ago

but Ive heard examples with bark using other voices.

Could you please give me some links? I straightaway started generating speech instead of looking at other’s generated speeches.

cybershrapnel commented 1 year ago

why do you keep using v2 in the path. Im not trying to be a jerk, but you clearly dont understand very basic coding concepts. Stop trying to call it v2, or if you are gonna call it v2, put the file in the v2 folder. You are having nothing but a path issue, which we can't help you with. Path issues are aq very basic coding principle. You need to learn you basics on paths before you go any further. You are only having an issue with the file path. thats it. Nothing else is going except you are pathing your file wrong.

cybershrapnel commented 1 year ago

@RahulBhalley https://www.reddit.com/r/singularity/comments/12udgzh/bark_text2speechbut_with_custom_voice_cloning/

RageshAntony commented 1 year ago

@cybershrapnel Well. I have 7 years of coding experience. Let me tell you in detail what I did

  1. I ran the sample code with the default speaker "v2/en_speaker_9". I executed and played successfully
  2. Then I overwrite the original en_speaker_9.npz with my own "en_speaker_9.npz" .
  3. Then when I ran it again
  4. But now, i got Assertion error
cybershrapnel commented 1 year ago

v2 npzs are different i think woudl explain your error. that generator clearly makes v1 npz

RageshAntony commented 1 year ago

@cybershrapnel May be. Let me check it

RageshAntony commented 1 year ago

@cybershrapnel I replaced the "en_speaker_9" inside the prompts folder

Still get assertation error

RageshAntony commented 1 year ago

let me check with the clone_voice.ipynb notebook

cybershrapnel commented 1 year ago

you need to reinstall, u cleary messed up something if thats not working, because it still sounds like ur having a path issue, i tried it under both normal and v2 and it worked

cybershrapnel commented 1 year ago

also, keep in mind there have been serious changes to this repo lately, and I think they broke it, I backed up to an older version, hence the memory issues the new versions introduces on 8g and lower cards. i think its due to the increased speed in inference but not sure. when i use the current version, get a lot of garbled audio. old version is almost perfect but very slow

EricKong1985 commented 1 year ago

warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") 100%|██████████| 92/92 [00:00<00:00, 112.01it/s] Traceback (most recent call last):

File "C:\Python310\lib\site-packages\bark\api.py", line 113, in generate_audio out = semantic_to_waveform( File "C:\Python310\lib\site-packages\bark\api.py", line 54, in semantic_to_waveform coarse_tokens = generate_coarse( File "C:\Python310\lib\site-packages\bark\generation.py", line 571, in generate_coarse round(x_coarse_history.shape[-1] / len(x_semantic_history), 1) AssertionError

Process finished with exit code 1 I follow the topic to clone my voice, then I hit this error, anyone know how to fix it ?

jn-jairo commented 1 year ago

Regarding loading the npz you must pass the full path with the .npz extension like:

from bark import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav

preload_models()

prompt = "Hello, my name is Suno. And, uh — and I like pizza. [laughs] But I also have other interests such as playing tic tac toe."

history_prompt = "/path/to/history_prompt.npz"

audio_array = generate_audio(prompt, history_prompt=history_prompt)

write_wav("/path/to/audio.wav", SAMPLE_RATE, audio_array)

About the other error, your npz is not correct and you should seek help where you created that npz.

Ps.: Looking at the link you provided they use a technique I tried before but it don't work, as far as I know there is no reliable method to really clone your voice. I tried the hubert based method mentioned below and it works fine.

JonathanFly commented 1 year ago

You're probably using the old cloning tech which produced invalid .npz files very often. Use the new hubert based methods. Most popular Bark UIs have it built in (including mine) and the original repo is https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer

JonathanFly commented 1 year ago

ok, i got a better result this time. but it still makes a lot of music noises and it was my voice I used to make the npz file and I prompted bark with [man] in the text and it still sound like voice 6 female... So I dunno... https://github.com/suno-ai/bark/assets/17352697/bf6ab70e-4f65-4bcf-a027-f468b3729514

Craft a voice with text prompts is generally done with a random voice, and then saving the bark output as a new .npz file. If you're cloning the text prompt isn't going to shape the text prompt much. It does shape it somewhat, so you can save the sample again and make new version. For example here's a variant of v2/en_spraker_3 I modified to speak faster. But that's a lot more fiddly. v2_en_speaker_3_double_expresso.zip

https://github.com/suno-ai/bark/assets/163408/da4f0571-0954-4481-b121-87db20f5fbd8

As an example of voice crafting try using a random voice (no history_prompt, no .npz file) with this prompt: Listen to my soothing, relaxing voice. Breathe calmly in, and out. Slowly close your eyes. Continue to breathe at this slow pace. Feel the air expand your lungs with each in breath.

You'll get a very high percentage of slow calm female voices.

JeavanCode commented 1 year ago

Hi guys, I am using bark with hugging face transfromers API and when I tried to call history_prompt I got an error "TypeError: string indices must be integers, not 'str'". It seems that transformers API expect a dictionary like type and the value must be torch.tensor. Any clues to solve the problem?

jn-jairo commented 1 year ago

Hi guys, I am using bark with hugging face transfromers API and when I tried to call history_prompt I got an error "TypeError: string indices must be integers, not 'str'". It seems that transformers API expect a dictionary like type and the value must be torch.tensor. Any clues to solve the problem?

@JeavanCode just pass the path of the npz file, but it usually complains about custom npz files that work fine with the original bark. Looks like if the npz is not exactly in the format they need it does nothing to crop the data like the original bark does.

from scipy.io import wavfile
from transformers import AutoProcessor, BarkModel

processor = AutoProcessor.from_pretrained("suno/bark-small")
model = BarkModel.from_pretrained("suno/bark-small")

voice_preset = "/path/to/history_prompt.npz"

inputs = processor("Hello, my dog is cute, I need him in my life", voice_preset=voice_preset)

audio_array = model.generate(**inputs, semantic_max_new_tokens=100)
audio_array = audio_array.cpu().numpy().squeeze()

sample_rate = model.generation_config.sample_rate
wavfile.write(f"/path/to/audio.wav", sample_rate, audio_array)
JeavanCode commented 1 year ago

Hi guys, I am using bark with hugging face transfromers API and when I tried to call history_prompt I got an error "TypeError: string indices must be integers, not 'str'". It seems that transformers API expect a dictionary like type and the value must be torch.tensor. Any clues to solve the problem?

@JeavanCode just pass the path of the npz file, but it usually complains about custom npz files that work fine with the original bark. Looks like if the npz is not exactly in the format they need it does nothing to crop the data like the original bark does.

from scipy.io import wavfile
from transformers import AutoProcessor, BarkModel

processor = AutoProcessor.from_pretrained("suno/bark-small")
model = BarkModel.from_pretrained("suno/bark-small")

voice_preset = "/path/to/history_prompt.npz"

inputs = processor("Hello, my dog is cute, I need him in my life", voice_preset=voice_preset)

audio_array = model.generate(**inputs, semantic_max_new_tokens=100)
audio_array = audio_array.cpu().numpy().squeeze()

sample_rate = model.generation_config.sample_rate
wavfile.write(f"/path/to/audio.wav", sample_rate, audio_array)

Thanks! I didin't know I need to pass voice_preset to voice_preset instead of pass history_prompt to model.generate. BTW, how do you figure it out, is there a document or handbook or sth. ? I always get confused when calling APIs like _BarkModel.frompretrained("suno/bark-small"), I don't understand how to traceback code like this.

jn-jairo commented 1 year ago

Thanks! I didin't know I need to pass voice_preset to voice_preset instead of pass history_prompt to model.generate. BTW, how do you figure it out, is there a document or handbook or sth. ? I always get confused when calling APIs like _BarkModel.frompretrained("suno/bark-small"), I don't understand how to traceback code like this.

Documentation https://huggingface.co/docs/transformers/model_doc/bark and source code https://github.com/huggingface/transformers

Maverick1983 commented 1 year ago

Hi, I created npz file with italian clone voice, but it's not good with italian language. I need to create a new hubert base model and after I will train audio?

jn-jairo commented 1 year ago

Hi, I created npz file with italian clone voice, but it's not good with italian language. I need to create a new hubert base model and after I will train audio?

Yes, to clone Italian you need a hubert model specific for Italian.

Maverick1983 commented 1 year ago

How can get a guide to train hubert base model?

Il ven 3 nov 2023, 15:51 Jairo Correa @.***> ha scritto:

Hi, I created npz file with italian clone voice, but it's not good with italian language. I need to create a new hubert base model and after I will train audio?

Yes, to clone Italian you need a hubert model specific for Italian.

— Reply to this email directly, view it on GitHub https://github.com/suno-ai/bark/issues/379#issuecomment-1792585792, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE43GZTVANBW2QC5XGR5HNDYCUAITAVCNFSM6AAAAAAZ7A3AYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJSGU4DKNZZGI . You are receiving this because you commented.Message ID: @.***>

jn-jairo commented 1 year ago

How can get a guide to train hubert base model?

https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer#how-do-i-train-it-myself

Maverick1983 commented 1 year ago

I already do It, but base model it's in english. I mean, how can create base hubert in italian for training other speaker.

Il ven 3 nov 2023, 20:33 Jairo Correa @.***> ha scritto:

How can get a guide to train hubert base model?

https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer#how-do-i-train-it-myself

— Reply to this email directly, view it on GitHub https://github.com/suno-ai/bark/issues/379#issuecomment-1792991390, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE43GZWMVMLRWDLE3F6FDBDYCVBHNAVCNFSM6AAAAAAZ7A3AYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJSHE4TCMZZGA . You are receiving this because you commented.Message ID: @.***>

jn-jairo commented 1 year ago

I already do It, but base model it's in english. I mean, how can create base hubert in italian for training other speaker.

Read the How do I train it myself? it explains how to create a new model in any language you want.

Maverick1983 commented 12 months ago

Repeat. I do It but it's not good with italian, because base pth it's in english

Il ven 3 nov 2023, 22:45 Jairo Correa @.***> ha scritto:

I already do It, but base model it's in english. I mean, how can create base hubert in italian for training other speaker.

Read the How do I train it myself? https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer#how-do-i-train-it-myself it explains how to create a new model in any language you want.

— Reply to this email directly, view it on GitHub https://github.com/suno-ai/bark/issues/379#issuecomment-1793139545, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE43GZTSG7NUSM62ZPJ32IDYCVQWTAVCNFSM6AAAAAAZ7A3AYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGEZTSNJUGU . You are receiving this because you commented.Message ID: @.***>

jn-jairo commented 12 months ago

@Maverick1983 Looks like you are having trouble finding it so I will copy and paste it here

https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer#how-do-i-train-it-myself


How do I train it myself?

Simply run the training commands.

A simple way to create semantic data and wavs for training, is with my script: bark-data-gen. But remember that the creation of the wavs will take around the same time if not longer than the creation of the semantics. This can take a while to generate because of that.

For example, if you have a dataset with zips containing audio files, one zip for semantics, and one for the wav files. Inside of a folder called "Literature"

You should run process.py --path Literature --mode prepare for extracting all the data to one directory

You should run process.py --path Literature --mode prepare2 for creating HuBERT semantic vectors, ready for training

You should run process.py --path Literature --mode train for training

And when your model has trained enough, you can run process.py --path Literature --mode test to test the latest model.


To create the dataset use this repository as example but CHANGE THE BOOKS TO ITALIAN BOOKS so it works with ITALIAN

https://github.com/gitmylo/bark-data-gen


After you do all this things you will have a PTH file for ITALIAN

Maverick1983 commented 12 months ago

I already do It... But not speak good italian.

Il dom 5 nov 2023, 03:19 Jairo Correa @.***> ha scritto:

@Maverick1983 https://github.com/Maverick1983 Looks like you are having trouble finding it so I will copy and paste it here

https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer#how-do-i-train-it-myself

How do I train it myself?

Simply run the training commands.

A simple way to create semantic data and wavs for training, is with my script: bark-data-gen https://github.com/gitmylo/bark-data-gen. But remember that the creation of the wavs will take around the same time if not longer than the creation of the semantics. This can take a while to generate because of that.

For example, if you have a dataset with zips containing audio files, one zip for semantics, and one for the wav files. Inside of a folder called "Literature"

You should run process.py --path Literature --mode prepare for extracting all the data to one directory

You should run process.py --path Literature --mode prepare2 for creating HuBERT semantic vectors, ready for training

You should run process.py --path Literature --mode train for training

And when your model has trained enough, you can run process.py --path Literature --mode test to test the latest model.

To create the dataset use this repository as example but CHANGE THE BOOKS TO ITALIAN BOOKS so it works with ITALIAN

https://github.com/gitmylo/bark-data-gen

After you do all this things you will have a PTH file for ITALIAN

— Reply to this email directly, view it on GitHub https://github.com/suno-ai/bark/issues/379#issuecomment-1793611154, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE43GZWOJQ77DYQI55PVHXLYC3ZUVAVCNFSM6AAAAAAZ7A3AYOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOJTGYYTCMJVGQ . You are receiving this because you were mentioned.Message ID: @.***>