Problems training a Portuguese quantizer model

Subarasheese commented 1 year ago

Greetings,

I've followed all the steps on the guide to train a Portuguese dataset, but unfortunately, across epochs, either the model did not really clone the voice, or provided voices that did resemble the target voice but produced bad speech outputs like screeching, saying things too slowly or "sounding drunk". I could not get a single model that managed to consistently produce good speeches with voices closely resembling the target voice despite training on a dataset of a little over 3200 samples up to 30 epochs (I tested every single epoch). For the dataset, I am using some public domain classic literature books and the Bible.

What I am doing wrong, and what can I do to improve the training and get better models?

gitmylo commented 1 year ago

Try using a different wav2vec model, maybe one that's better for languages aside from english. Also check your dataset, does it sound normal? If not, that's a bark issue, not really anything i can do about that.

Subarasheese commented 1 year ago

Try using a different wav2vec model, maybe one that's better for languages aside from english. Also check your dataset, does it sound normal? If not, that's a bark issue, not really anything i can do about that.

The audio from the dataset are mostly OK (it is what you would expect from Bark, a mixed bag; there is indeed bad audio in there but most of them are good in my book, as in, if the cloning model could produce those outputs I would be satisfied, unfortunately it can't)

Also, I was using a Hubert model, not Wav2Vec... But I created custom code to load this model:

https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese

Does this repo already has code to load Wav2Vec models?

Also, would I need Wave2Vec only when training, or would I need when cloning a voice from a .wav file too? Because from the repo example as well as the webuis it seem Hubert is also used when extracting features from the input .wav voice that will be used in cloning

Subarasheese commented 1 year ago

@gitmylo I modified the preparation script to use Wav2Rec instead of Hubert, however, the semantic features files are in a different format that is accepted by the training script that reads the Hubert-generated files instead...

Can you give me a hand on this?


def prepare2(path,model):
    prepared = os.path.join(path, 'prepared')
    ready = os.path.join(path, 'ready')
    model_name = "Edresson/wav2vec2-large-xlsr-coraa-portuguese"
    processor = Wav2Vec2Processor.from_pretrained(model_name)
    model = Wav2Vec2Model.from_pretrained(model_name)
    if not os.path.isdir(ready):
        os.mkdir(ready)

    wav_string = '_wav.wav'
    sem_string = '_semantic.npy'
    for input_file in os.listdir(prepared):
        input_path = os.path.join(prepared, input_file)
        if input_file.endswith(wav_string):
            file_num = int(input_file[:-len(wav_string)])
            fname = f'{file_num}_semantic_features.npy'
            print('Processing', input_file)
            if os.path.isfile(fname):
                continue
            wav, sr = torchaudio.load(input_path)

            if wav.shape[0] == 2:  # Stereo to mono if needed
                wav = wav.mean(dim=0)
            resampler = torchaudio.transforms.Resample(orig_freq=sr, new_freq=16000)  # Move this line here
            wav = resampler(wav)  # Resample to 16,000 Hz
            inputs = processor(wav.squeeze().numpy(), return_tensors="pt", padding=True, sampling_rate=16000)
            with torch.no_grad():
                outputs = model(**inputs)
            out_array = outputs.last_hidden_state.cpu().numpy()
            numpy.save(os.path.join(ready, fname), out_array)
        elif input_file.endswith(sem_string):
            fname = os.path.join(ready, input_file)
            if os.path.isfile(fname):
                continue
            shutil.copy(input_path, fname)
    print('All set! We\'re ready to train!')

gitmylo commented 1 year ago

check the shape of the outputs of wav2vec, and when creating the HuBERT quantizer model, set the CustomTokenizer's input_size to that value.

Subarasheese commented 1 year ago

check the shape of the outputs of wav2vec, and when creating the HuBERT quantizer model, set the CustomTokenizer's input_size to that value.

Here are the outputs of shape and hidden size:

(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode prepare2
Processing 0_wav.wav
Shape of Wav2Vec model output: (1, 325, 1024)
Hidden size (last dimension of model output): 1024
Processing 1_wav.wav
Shape of Wav2Vec model output: (1, 642, 1024)
Hidden size (last dimension of model output): 1024
Processing 2_wav.wav
Shape of Wav2Vec model output: (1, 357, 1024)
Hidden size (last dimension of model output): 1024
Processing 3_wav.wav
Shape of Wav2Vec model output: (1, 689, 1024)
Hidden size (last dimension of model output): 1024
Processing 4_wav.wav

From what I understand, you asked me to do this:

class CustomTokenizer(nn.Module):
    def __init__(self, hidden_size=1024, input_size=1024, output_size=10000, version=0):
        super(CustomTokenizer, self).__init__()
        input_size = 1024
        next_size = input_size
(...)

However, this change alone is not enough, as it results in errors:

(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode train
Creating new model.
Traceback (most recent call last):
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
    auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 184, in auto_train
    model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0)  # Print loss every 50 steps
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 86, in train_step
    loss = lossfunc(y_pred, y_train_hot)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: expected scalar type Long but found Float

Even after modifying the train_step method, removing the lines of code that were creating a one-hot vector from y_train and adding a line to convert y_train to a Long tensor using the long() method resulted in an error:

(venv) [user@user bark-voice-cloning-HuBERT-quantizer]$ python process.py --path Literature --mode train
Creating new model.
Traceback (most recent call last):
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
    auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 181, in auto_train
    model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0)  # Print loss every 50 steps
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 82, in train_step
    loss = lossfunc(y_pred, y_train)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: Expected target size [1, 10000], got [1]

Any ideas?

gitmylo commented 1 year ago

You made a mistake by modifying the customtokenizer, it doesn't require any modifications. this is the line that requires modifications, you need to add input_size=1024 to this line.

Doing it in the way you did above, will cause incompatibility with other models, which would prevent you from loading it with the standard automatic loader function load_from_checkpoint.

And removing the one-hot is obviously not a great idea. it converts a tensor from indicating a position, to an absolute position. Changing the shape from (n,) to (n, 10000).

Subarasheese commented 1 year ago

You made a mistake by modifying the customtokenizer, it doesn't require any modifications. this is the line that requires modifications, you need to add input_size=1024 to this line.

Doing it in the way you did above, will cause incompatibility with other models, which would prevent you from loading it with the standard automatic loader function load_from_checkpoint.

And removing the one-hot is obviously not a great idea. it converts a tensor from indicating a position, to an absolute position. Changing the shape from (n,) to (n, 10000).

I did that simple change, however, this problem occurs in the script:


Creating new model.
Traceback (most recent call last):
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/process.py", line 19, in <module>
    auto_train(path, load_model=os.path.join(path, 'model.pth'), save_epochs=args.train_save_epochs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 183, in auto_train
    model_training.train_step(torch.tensor(x).to('cuda'), torch.tensor(y).to('cuda'), j % 50 == 0)  # Print loss every 50 steps
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/hubert/customtokenizer.py", line 85, in train_step
    loss = lossfunc(y_pred, y_train_hot)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/modules/loss.py", line 1174, in forward
    return F.cross_entropy(input, target, weight=self.weight,
  File "/run/media/myssd/bark-data-gen/bark-voice-cloning-HuBERT-quantizer/venv/lib/python3.10/site-packages/torch/nn/functional.py", line 3029, in cross_entropy
    return torch._C._nn.cross_entropy_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index, label_smoothing)
RuntimeError: expected scalar type Long but found Float

The only modification in the script is:

    model_training = CustomTokenizer(version=1,input_size=1024).to('cuda')

gitmylo commented 1 year ago

Make sure the data types are correct, and are correctly converted if needed.

Subarasheese commented 1 year ago

Make sure the data types are correct, and are correctly converted if needed.

Ok, so I just added .long() to this line:

    loss = lossfunc(y_pred, y_train_hot.long())

And it started training... With suspiciously really low loss:

Creating new model. Loss 5.78387975692749 Loss 0.004742730874568224 Loss 0.003009543986991048 Loss 0.001717338222078979 Loss 0.0006962093175388873 Loss 0.0011351052671670914 Loss 0.0004761719610542059 Loss 0.001148720970377326 Loss 0.0010141769889742136 Loss 0.0012041080044582486 Loss 0.000879546336364001 (...)

So I went to test the model...

And then this happened:

Loading Hubert ./models/hubert/hubert.pt
Loading Custom Hubert Tokenizer ./models/hubert/pt_tokenizer.pth
Traceback (most recent call last):
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/routes.py", line 439, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1384, in process_api
    result = await self.call_function(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1089, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/Coding/bark-gui/cloning/clonevoice.py", line 48, in clone_voice
    semantic_tokens = tokenizer.get_token(semantic_vectors)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 55, in get_token
    return torch.argmax(self(x), dim=1)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 40, in forward
    x, _ = self.lstm(x)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 810, in forward
    self.check_forward_args(input, hx, batch_sizes)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 730, in check_forward_args
    self.check_input(input, batch_sizes)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/rnn.py", line 218, in check_input
    raise RuntimeError(
RuntimeError: input.size(-1) must be equal to input_size. Expected 1024, got 768

So I changed everything related to input_size to 768 and this happened:

Loading Hubert ./models/hubert/hubert.pt
Loading Custom Hubert Tokenizer ./models/hubert/pt_tokenizer.pth
Traceback (most recent call last):
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/routes.py", line 439, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1384, in process_api
    result = await self.call_function(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/blocks.py", line 1089, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/gradio/utils.py", line 700, in wrapper
    response = f(*args, **kwargs)
  File "/home/user/Coding/bark-gui/cloning/clonevoice.py", line 34, in clone_voice
    tokenizer = CustomTokenizer.load_from_checkpoint(f'./models/hubert/{tokenizer_lang}_tokenizer.pth').to(device)  # Automatically uses the right layers
  File "/home/user/Coding/bark-gui/bark/hubert/customtokenizer.py", line 122, in load_from_checkpoint
    model.load_state_dict(torch.load(path))
  File "/home/user/Coding/bark-gui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2041, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for CustomTokenizer:
        size mismatch for lstm.weight_ih_l0: copying a param with shape torch.Size([4096, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 768]).

On clonevoice.py, I even changed the code to do this (padding the vectors with zeros up to 1024 ) and of course that turned out to be a terrible idea as the resulting cloned npz came out corrupted:


    def pad_to_size(vec, size):
        zeros = torch.zeros((vec.shape[0], size - vec.shape[1]))
        vec = torch.cat([vec, zeros], dim=1)
        return vec

    # Before you call get_token:
    semantic_vectors = hubert_model.forward(wav, input_sample_hz=model.sample_rate)
    semantic_vectors = pad_to_size(semantic_vectors, 1024)
    semantic_tokens = tokenizer.get_token(semantic_vectors)

What do you suggest? Am I missing something?

Subarasheese commented 1 year ago

This issue can be closed. I finished training a portuguese quantizer. 24 epochs, 0.0005 LR, over 4000 utterances.

Model weights:

https://huggingface.co/MadVoyager/bark-voice-cloning-portuguese-HuBERT-quantizer

Dataset:

https://huggingface.co/datasets/MadVoyager/bark-portuguese-semantic-wav-training/

Retraining using a lower Learning Rate and more utterances greatly improved the model.