gitmylo / bark-voice-cloning-HuBERT-quantizer

The code for the bark-voicecloning model. Training and inference.
MIT License
671 stars 111 forks source link

"no description" when bark run #1

Closed NickAnastasoff closed 1 year ago

NickAnastasoff commented 1 year ago

I have tried to create an npz, although I think I have done something wrong. I have gotten bark running up until generate_coarse: Exception has occurred: AssertionError exception: no description File "/Users/nickanastasoff/Desktop/bark test/bark/bark/generation.py", line 573, in generate_coarse round(x_coarse_history.shape[-1] / len(x_semantic_history), 1) File "/Users/nickanastasoff/Desktop/bark test/bark/bark/api.py", line 54, in semantic_to_waveform coarse_tokens = generate_coarse( File "/Users/nickanastasoff/Desktop/bark test/bark/bark/api.py", line 113, in generate_audio out = semantic_to_waveform(

customHuburt.txt This is what I used to make the npz. Im pretty sure the issue is with fine_prompts = codes but im not sure what else to do.

gitmylo commented 1 year ago

This doesn't seem to be an issue with my repository. This repository exclusively extracts semantics.

Also, i was not able to reproduce the issue, your code worked fine on my side.

NickAnastasoff commented 1 year ago

Thank you so much for your reply! sadly it still didn't work for me. How did you generate the npz? this is what I wrote, so its probably the issue: ``` from encodec import EncodecModel from encodec.utils import convert_audio

import torchaudio import torch

""" Instantiate a pretrained EnCodec model model = EncodecModel.encodec_model_24khz() The number of codebooks used will be determined bythe bandwidth selected. E.g. for a bandwidth of 6kbps, n_q = 8 codebooks are used. Supported bandwidths are 1.5kbps (n_q = 2), 3 kbps (n_q = 4), 6 kbps (n_q = 8) and 12 kbps (n_q =16) and 24kbps (n_q=32). For the 48 kHz model, only 3, 6, 12, and 24 kbps are supported. The number of codebooks for each is half that of the 24 kHz model as the frame rate is twice as much. model.set_target_bandwidth(6.0)"""

"""Load and pre-process the audio waveform""" wav, sr = torchaudio.load("0520.wav") wav = convert_audio(wav, sr, model.sample_rate, model.channels) wav = wav.unsqueeze(0)

"""Extract discrete codes from EnCodec""" with torch.no_grad(): encoded_frames = model.encode(wav) codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1) # [B, n_q, T]

fine_prompt = codes <- is this the issue?

coarse = fine_prompt[:2, :]

import numpy

numpy.savez(semantic_prompt=semantic_tokens, fine_prompt=fine_prompt, coarse_prompt=coarse, file="pleasework.npz")```

gitmylo commented 1 year ago

You should probably wrap your code in code blocks (``` around your text) in the future.

I ran that code, and it created the file just fine. Can you send me the wav you're using? I think your input wav is a bit broken, and encodec can't load it.

Again, this issue is not really related to my repository here. But it's probably your wav file.

NickAnastasoff commented 1 year ago

Sorry for the wait!

I put my code into a jupyter notebook, and I still got the same problem! Ill link that, and my audio.wav is in it.

Thanks so much for your time!

VoiceCloning Google Colab

gitmylo commented 1 year ago

You can't upload an audio file like that to google colab, since it's storage is not persistent.

Check if you can clone the file in here

NickAnastasoff commented 1 year ago

I found the problem! You were right! I shortened my wav to under 10 seconds, and its working, thank you so much! btw, It might be helpful for others if you put that google colab I had above in the readme https://colab.research.google.com/drive/1IA3c_R859nANerMARazCSrjc2UD3ws8A?usp=sharing

gitmylo commented 1 year ago

Oh, actually, i noticed this today.

codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1) # [B, n_q, T]

should be

codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1).squeeze() # [B, n_q, T]

NickAnastasoff commented 1 year ago

That makes sense! The code I write is usually the problem 🤣

Thanks so much!

gitmylo commented 1 year ago

That makes sense! The code I write is usually the problem 🤣

Thanks so much!

That was actually something i was missing in the old version, plus the encodec example doesn't have it. So that's on me.

NickAnastasoff commented 1 year ago

For anyone trying to find an answer - codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1) # [B, n_q, T]

should be

codes = torch.cat([encoded[0] for encoded in encoded_frames], dim=-1).squeeze() # [B, n_q, T]

And shorten audio file to under 10 seconds

gitmylo commented 1 year ago

You don't need to shorten the audio, but it's recommended to shorten it to 15 or 20 seconds, going beyond 15 seconds will result in less audio for it to clone from.

make sure you take the audio from the end, not the start.