Closed Subarasheese closed 1 year ago
Try using a different wav2vec model, maybe one that's better for languages aside from english. Also check your dataset, does it sound normal? If not, that's a bark issue, not really anything i can do about that.
Try using a different wav2vec model, maybe one that's better for languages aside from english. Also check your dataset, does it sound normal? If not, that's a bark issue, not really anything i can do about that.
The audio from the dataset are mostly OK (it is what you would expect from Bark, a mixed bag; there is indeed bad audio in there but most of them are good in my book, as in, if the cloning model could produce those outputs I would be satisfied, unfortunately it can't)
Also, I was using a Hubert model, not Wav2Vec... But I created custom code to load this model:
https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese
Does this repo already has code to load Wav2Vec models?
Also, would I need Wave2Vec only when training, or would I need when cloning a voice from a .wav file too? Because from the repo example as well as the webuis it seem Hubert is also used when extracting features from the input .wav voice that will be used in cloning
@gitmylo I modified the preparation script to use Wav2Rec instead of Hubert, however, the semantic features files are in a different format that is accepted by the training script that reads the Hubert-generated files instead...
Can you give me a hand on this?
def prepare2(path,model):
prepared = os.path.join(path, 'prepared')
ready = os.path.join(path, 'ready')
model_name = "Edresson/wav2vec2-large-xlsr-coraa-portuguese"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2Model.from_pretrained(model_name)
if not os.path.isdir(ready):
os.mkdir(ready)
wav_string = '_wav.wav'
sem_string = '_semantic.npy'
for input_file in os.listdir(prepared):
input_path = os.path.join(prepared, input_file)
if input_file.endswith(wav_string):
file_num = int(input_file[:-len(wav_string)])
fname = f'{file_num}_semantic_features.npy'
print('Processing', input_file)
if os.path.isfile(fname):
continue
wav, sr = torchaudio.load(input_path)
if wav.shape[0] == 2: # Stereo to mono if needed
wav = wav.mean(dim=0)
resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000) # Move this line here
wav = resampler(wav) # Resample to 16,000 Hz
inputs = processor(wav.squeeze().numpy(), return_tensors="pt", padding=True, sampling_rate=16000)
with torch.no_grad():
outputs = model(**inputs)
out_array = outputs.last_hidden_state.cpu().numpy()
numpy.save(os.path.join(ready, fname), out_array)
elif input_file.endswith(sem_string):
fname = os.path.join(ready, input_file)
if os.path.isfile(fname):
continue
shutil.copy(input_path, fname)
print('All set! We\'re ready to train!')
check the shape of the outputs of wav2vec, and when creating the HuBERT quantizer model, set the CustomTokenizer
's input_size
to that value.
check the shape of the outputs of wav2vec, and when creating the HuBERT quantizer model, set the
CustomTokenizer
'sinput_size
to that value.
Here are the outputs of shape and hidden size:
(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode prepare2
Processing 0_wav.wav
Shape of Wav2Vec model output: (1, 325, 1024)
Hidden size (last dimension of model output): 1024
Processing 1_wav.wav
Shape of Wav2Vec model output: (1, 642, 1024)
Hidden size (last dimension of model output): 1024
Processing 2_wav.wav
Shape of Wav2Vec model output: (1, 357, 1024)
Hidden size (last dimension of model output): 1024
Processing 3_wav.wav
Shape of Wav2Vec model output: (1, 689, 1024)
Hidden size (last dimension of model output): 1024
Processing 4_wav.wav
From what I understand, you asked me to do this:
class CustomTokenizer(nn.Module):
def __init__(self, hidden_size=1024, input_size=1024, output_size=10000, version=0):
super(CustomTokenizer, self).__init__()
input_size = 1024
next_size = input_size
(...)
However, this change alone is not enough, as it results in errors:
(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode train
Creating new model.
Traceback (most recent call last):
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 184, in auto_train
model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0) # Print loss every 50 steps
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 86, in train_step
loss = lossfunc(y_pred, y_train_hot)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: expected scalar type Long but found Float
Even after modifying the train_step method, removing the lines of code that were creating a one-hot vector from y_train and adding a line to convert y_train to a Long tensor using the long() method resulted in an error:
(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode train
Creating new model.
Traceback (most recent call last):
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 181, in auto_train
model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0) # Print loss every 50 steps
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 82, in train_step
loss = lossfunc(y_pred, y_train)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected target size [1, 10000], got [1]
Any ideas?
You made a mistake by modifying the customtokenizer, it doesn't require any modifications. this is the line that requires modifications, you need to add input_size=1024
to this line.
Doing it in the way you did above, will cause incompatibility with other models, which would prevent you from loading it with the standard automatic loader function load_from_checkpoint
.
And removing the one-hot is obviously not a great idea. it converts a tensor from indicating a position, to an absolute position. Changing the shape from (n,)
to (n, 10000)
.
You made a mistake by modifying the customtokenizer, it doesn't require any modifications. this is the line that requires modifications, you need to add
input_size=1024
to this line.Doing it in the way you did above, will cause incompatibility with other models, which would prevent you from loading it with the standard automatic loader function
load_from_checkpoint
.And removing the one-hot is obviously not a great idea. it converts a tensor from indicating a position, to an absolute position. Changing the shape from
(n,)
to(n, 10000)
.
I did that simple change, however, this problem occurs in the script:
Creating new model.
Traceback (most recent call last):
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 183, in auto_train
model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0) # Print loss every 50 steps
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 85, in train_step
loss = lossfunc(y_pred, y_train_hot)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
return F.cross_entropy(input, target, weight=self.weight,
File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: expected scalar type Long but found Float
The only modification in the script is:
model_training = CustomTokenizer(version=1,input_size=1024).to('cuda')
Make sure the data types are correct, and are correctly converted if needed.
Make sure the data types are correct, and are correctly converted if needed.
Ok, so I just added .long() to this line:
loss = lossfunc(y_pred, y_train_hot.long())
And it started training... With suspiciously really low loss:
Creating new model. Loss 5.78387975692749 Loss 0.004742730874568224 Loss 0.003009543986991048 Loss 0.001717338222078979 Loss 0.0006962093175388873 Loss 0.0011351052671670914 Loss 0.0004761719610542059 Loss 0.001148720970377326 Loss 0.0010141769889742136 Loss 0.0012041080044582486 Loss 0.000879546336364001 (...)
So I went to test the model...
And then this happened:
Loading Hubert ./models/hubert/hubert.pt
Loading Custom Hubert Tokenizer ./models/hubert/pt_tokenizer.pth
Traceback (most recent call last):
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/routes.py", line 439, in run_predict
output = await app.get_blocks().process_api(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1384, in process_api
result = await self.call_function(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1089, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
response = f(*args, **kwargs)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
response = f(*args, **kwargs)
File "/home/user/Coding/bark-gui/cloning/clonevoice.py", line 48, in clone_voice
semantic_tokens = tokenizer.get_token(semantic_vectors)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 55, in get_token
return torch.argmax(self(x), dim=1)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 40, in forward
x, _ = self.lstm(x)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 810, in forward
self.check_forward_args(input, hx, batch_sizes)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 730, in check_forward_args
self.check_input(input, batch_sizes)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 218, in check_input
raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 1024, got 768
So I changed everything related to input_size to 768 and this happened:
Loading Hubert ./models/hubert/hubert.pt
Loading Custom Hubert Tokenizer ./models/hubert/pt_tokenizer.pth
Traceback (most recent call last):
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/routes.py", line 439, in run_predict
output = await app.get_blocks().process_api(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1384, in process_api
result = await self.call_function(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1089, in call_function
prediction = await anyio.to_thread.run_sync(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
result = context.run(func, *args)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
response = f(*args, **kwargs)
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
response = f(*args, **kwargs)
File "/home/user/Coding/bark-gui/cloning/clonevoice.py", line 34, in clone_voice
tokenizer = CustomTokenizer.load_from_checkpoint(f'./models/hubert/{tokenizer_lang}_tokenizer.pth').to(device) # Automatically uses the right layers
File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 122, in load_from_checkpoint
model.load_state_dict(torch.load(path))
File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CustomTokenizer:
size mismatch for lstm.weight_ih_l0: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 768]).
On clonevoice.py, I even changed the code to do this (padding the vectors with zeros up to 1024 ) and of course that turned out to be a terrible idea as the resulting cloned npz came out corrupted:
def pad_to_size(vec, size):
zeros = torch.zeros((vec.shape[0], size - vec.shape[1]))
vec = torch.cat([vec, zeros], dim=1)
return vec
# Before you call get_token:
semantic_vectors = hubert_model.forward(wav, input_sample_hz=model.sample_rate)
semantic_vectors = pad_to_size(semantic_vectors, 1024)
semantic_tokens = tokenizer.get_token(semantic_vectors)
What do you suggest? Am I missing something?
This issue can be closed. I finished training a portuguese quantizer. 24 epochs, 0.0005 LR, over 4000 utterances.
Model weights:
https://huggingface.co/MadVoyager/bark-voice-cloning-portuguese-HuBERT-quantizer
Dataset:
https://huggingface.co/datasets/MadVoyager/bark-portuguese-semantic-wav-training/
Retraining using a lower Learning Rate and more utterances greatly improved the model.
@Subarasheese Can help me to train Italian? Can you write steps do you perform?
Thanks
@Subarasheese Beleza? cara, eu tô utilizando teus weights, mas percebi um Sotaque puxado para o paulista,isso é devido ao dataset?
Greetings,
I've followed all the steps on the guide to train a Portuguese dataset, but unfortunately, across epochs, either the model did not really clone the voice, or provided voices that did resemble the target voice but produced bad speech outputs like screeching, saying things too slowly or "sounding drunk". I could not get a single model that managed to consistently produce good speeches with voices closely resembling the target voice despite training on a dataset of a little over 3200 samples up to 30 epochs (I tested every single epoch). For the dataset, I am using some public domain classic literature books and the Bible.
What I am doing wrong, and what can I do to improve the training and get better models?