Question about zero-shot TTS

Thank you for your open source work, but I seem to have not found the complete implementation of zero-shot TTS.

The default dataset for radtts in the tutorials does not include the file coqui_resnet_512_emb.pt. Where can I find this file or is there any related code to generate this file?
The zero-shot feature in radtts seems incomplete.
The zero-shot feature in vits requires loading a pretrained model, but it doesn't seem to be provided. Also, the training code for the corresponding encoder is not available as well.
Are there any examples or demos related to zero-shot TTS?
How to Inference the model generated by the radtts?

The model is here https://github.com/coqui-ai/TTS/releases/tag/speaker_encoder_model
It does work just poorly documented. Where are you getting stuck?
We haven't used the zero-shot in VITS recently. I wouldn't use this.
Examples are the AI rap section at app.uberduck.ai.
Where exactly are you getting stuck? It is a pretty standard forward pass but I can provide examples as well.

I want to use radrtt to clone a new character's voice for TTS. Should I finetune the model processed with lj training? Does zero-shot mean cloning a new character's voice without fine-tuning?
I used the default training parameters in tutorials/radtts/demo_config.json to train, and I wrote inference code, but an error occurred. Do I need to enable dur_pred_layer module during training?Can you take a look at how I'm calling it to see if it's correct?

def warmstart(checkpoint_path, model,  strict=False):
    pretrained_dict = torch.load(checkpoint_path, map_location="cpu")
    pretrained_dict = pretrained_dict["state_dict"]

    is_module = False
    if list(pretrained_dict.keys())[0].startswith("module."):
        is_module = True
    if is_module:
        new_state_dict = OrderedDict()
        for k, v in pretrained_dict.items():
            name = k[7:]  # remove `module.`
            new_state_dict[name] = v
        pretrained_dict = new_state_dict

    model_dict = model.state_dict()
    model_dict.update(pretrained_dict)
    model.load_state_dict(model_dict, strict=strict)
    print(f"Warm started from {checkpoint_path} is module {is_module}")
    model.eval()

    return model

def parse_args(args):
    parser = argparse.ArgumentParser()
    parser.add_argument("--config", help="Path to JSON config")
    parser.add_argument('-o', "--output_dir", default="results")
    parser.add_argument('--text', "--text", default="hello world")
    args = parser.parse_args(args)
    return args

if __name__ == "__main__":
    args = parse_args(sys.argv[1:])
    if args.config:
        with open(args.config) as f:
            config = json.load(f)
    else:
        print(f"!!!no config")
        exit(-1)

    model_config = config["model_config"]
    model = RADTTS(**model_config)

    pred_config = config["pred_config"]

    model = warmstart(pred_config["warmstart_checkpoint_path"] , model)
    #vocoder
    vocoder = get_vocoder(
        hifi_gan_config_path = pred_config["vocoder_config_path"],
        hifi_gan_checkpoint_path = pred_config["vocoder_checkpoint_path"],
    )

    ignore_keys = ["training_files", "validation_files"]
    print("initializing training dataloader")
    data_config = config["data_config"]
    dataset = Data(
        data_config["training_files"],
        **dict((k, v) for k, v in data_config.items() if k not in ignore_keys),
    )

    text = dataset.get_text(args.text).unsqueeze(0)
    print(f"type(text)={type(text)} text.shape={text.shape}")

    speaker_id = torch.LongTensor([0])
    model_output = model.infer(speaker_id, text, sigma=0.8)

    mels = model_output["mel"]
    if hasattr(vocoder, "forward"):
        audio = vocoder(mels.cpu()).float()[0]
    audio = audio[0].detach().cpu().numpy()
    audio = audio / np.abs(audio).max()

    now = datetime.now()
    suffix_path = now.strftime("%H_%M_%S")

    write("{}/{}.wav".format(args.output_dir, suffix_path))

Traceback (most recent call last):
  File "./uberduck-ml-dev/uberduck_ml_dev/exec/inference.py", line 98, in <module>
    model_output = model.infer(speaker_id, text, sigma=0.8)
  File "./uberduck-ml-dev/uberduck_ml_dev/models/radtts.py", line 752, in infer
    dur = self.dur_pred_layer.infer(
  File "./.conda/envs/test-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1265, in __getattr__
    raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'RADTTS' object has no attribute 'dur_pred_layer'

Zero-shot means cloning without fine tuning. In this situation, you don't need to train, but rather just to load the zero shot model https://huggingface.co/Uberduck/ZeroShotRADTTS and compute the embedding associated with an audio clip using the speaker encoder model.

However, results will probably be better if you fine tune. You can also fine tune the zero shot model using is_zero_shot = True, or fine tune a standard multispeaker model trained from scratch on LJ. It is probably better to fine tune the zero shot model. You can also train a two speaker model from scratch with data from LJ and the voice you are specifically creating.

Yes, you need to enable this part of the network. The reason for this is that we generally first train the decoder, and then afterwards train the duration, pitch, and energy attribute predictors. We've change several settings in the config file for a second phase of training when training from scratch.
```
"include_modules": "decatndpmvpredapm",
"binarization_start_iter": 0,
"kl_loss_start_iter": 0,
"learning_rate": 0.0005,
"log_attribute_samples": true,
"unfreeze_modules": "durf0energyvpred",
"output_directory":  NEW_OUTPUT_DIR
"warm_start_name":  PATH_TO_PREVIOUS_CHECKPOINT
```

If you load the model with "include_modules": "decatndpmvpredapm" you should be able to perform inference, but you'll probably need to train first (unless you load a pretrained set of attribute predictors).

When loading the zero-shot model(https://huggingface.co/Uberduck/ZeroShotRADTTS), a KeyError: 'state_dict' error occurred. Does this mean that the model doesn't match the latest code? I used RADRTT to load the zero-shot model.
How was the zero-shot model trained? Was it trained directly using tutorials/radtts/demo.config.json and lj data?
I ran the default tutorials/radtts/demo.config.json and lj model on a TITAN Xp GPU, and the training took about 1 day. Is this a reasonable amount of time? The training automatically stopped after only 130,000 steps.
Regarding inference parameters, what value should be used for sigma? Should the rest be left as default? Is audio_embedding the same as the embedding used in zero-shot, which can be calculated using https://github.com/coqui-ai/TTS/releases/tag/speaker_encoder_model?

#RADTTS
def infer(
        self,
        speaker_id,
        text,
        sigma,
        sigma_dur=0.8,
        sigma_f0=0.8,
        sigma_energy=0.8,
        token_dur_scaling=1.0,
        token_duration_max=100,
        speaker_id_text=None,
        speaker_id_attributes=None,
        dur=None,
        f0=None,
        energy_avg=None,
        voiced_mask=None,
        f0_mean=0.0,
        f0_std=0.0,
        energy_mean=0.0,
        energy_std=0.0,
        audio_embedding=None,
        text_lengths=None,
    )

If you run torch.load(path_to_zero_shot_model).keys() you should see whether it has the state_dict key containing the model weights, or just the model weights themselves. If its just the weights, you can convert them using OrderedDict(state_dict = torch.load(path_to_zero_shot_model)).
Hmmm, that model was trained slightly differently. It was trained on mostly publicly available data of about 1500 speakers. But overall the training process was similar (is_zero_shot was true, and there were two training stages as I was saying in my previous comment - one for decoder, one for attribute predictor, as determined by the include_modules parameter).
That is a reasonable amount of time. We run for 200 - 400k steps, depending on batch size. You should definitely hear recognizable results after a few minutes and reasonable quality results after a few hours.
Not really sure what the optimum sigma is. We just left it as default.

I was able to load the zero-shot model using the method you provided, but the program still throws an error. Upon investigation, it seems that there may be a problem with the dur results, and I have included the erroneous values as comments at the end of the code. My input text is "hello world".Using a lj model that I trained myself, the inference results are normal and can generate wav files.

#in models/radrtts.py 
#class RADTTS 
def infer(...):
  ...
  if dur is None:
        # TODO (Sam): replace non-controllable is_available with controllable global setting. This is useful for debugging.
        if torch.cuda.is_available():
            z_dur = torch.cuda.FloatTensor(batch_size, 1, n_tokens)
        else:
            z_dur = torch.FloatTensor(batch_size, 1, n_tokens)
        z_dur = z_dur.normal_() * sigma_dur

        dur = self.dur_pred_layer.infer(
            z_dur, txt_enc, spk_vec_text, lens=text_lengths
        )
        if dur.shape[-1] < txt_enc.shape[-1]:
            to_pad = txt_enc.shape[-1] - dur.shape[2]
            pad_fn = nn.ReplicationPad1d((0, to_pad))
            dur = pad_fn(dur)
        dur = dur[:, 0]   
        #dur=tensor([[[-0.2046, -0.1092,  0.0814, -0.1135,  0.2197,  0.0398,  0.4338, 0.3603,  0.3822,  0.2547, -0.0898, -0.0038]]], device='cuda:0')

        dur = dur.clamp(0, token_duration_max)
        #dur=tensor([[0.0000, 0.0000, 0.0814, 0.0000, 0.2197, 0.0398, 0.4338, 0.3603, 0.3822,0.2547, 0.0000, 0.0000]], device='cuda:0', grad_fn=<ClampBackward1>)

        dur = dur * token_dur_scaling if token_dur_scaling > 0 else dur
        #dur=tensor([[0.0000, 0.0000, 0.0814, 0.0000, 0.2197, 0.0398, 0.4338, 0.3603, 0.3822,0.2547, 0.0000, 0.0000]], device='cuda:0', grad_fn=<ClampBackward1>)
       #token_dur_scaling=1

        dur = (dur + 0.5).floor().int() 
        #dur=tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)

    out_lens = dur.sum(1).long().cpu() if dur.shape[0] != 1 else [dur.sum(1)] #out_lens=tensor([0], device='cuda:0')

    max_n_frames = max(out_lens) #max_n_frames=tensor([0], device='cuda:0')

    out_lens = torch.LongTensor(out_lens).to(txt_enc.device) #tensor([0], device='cuda:0')

    txt_enc_time_expanded = self.length_regulator(
        txt_enc.transpose(1, 2), dur
    ).transpose(1, 2)
    #txt_enc_time_expanded torch.Szie([1, 512, 0])
    #text_enc torch.Size(1, 512, 12)
    #dur=tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32)

  ...

Traceback (most recent call last):
  File "./uberduck-ml-dev/uberduck_ml_dev/exec/inference.py", line 159, in <module>
    model_output = model.infer(speaker_id, text, sigma=0.8)
  File "./uberduck-ml-dev/uberduck_ml_dev/models/radtts.py", line 778, in infer
    voiced_mask = self.v_pred_module.infer(
  File "./uberduck-ml-dev/uberduck_ml_dev/models/components/attribute_prediction_model.py", line 137, in infer
    x_hat = self.forward(txt_enc, spk_emb, x=None, lens=lens)["x_hat"]
  File "./uberduck-ml-dev/uberduck_ml_dev/models/components/attribute_prediction_model.py", line 127, in forward
    txt_enc = self.bottleneck_layer(txt_enc)
  File ".conda/envs/test-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "./uberduck-ml-dev/uberduck_ml_dev/models/components/attribute_prediction_model.py", line 99, in forward
    x = self.projection_fn(x)
  File ".conda/envs/test-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "./uberduck-ml-dev/uberduck_ml_dev/models/common.py", line 1521, in forward
    conv_signal = self.conv(signal)
  File ".conda/envs/test-env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1208, in _call_impl
    result = forward_call(*input, **kwargs)
  File ".conda/envs/test-env/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 313, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File ".conda/envs/test-env/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 309, in _conv_forward
    return F.conv1d(input, weight, bias, self.stride,
RuntimeError: Calculated padded input size per channel: (2). Kernel size: (3). Kernel size can't be greater than actual input size

Do I need to perform finetuning for both training stages, or is it enough to only finetune one of them? Should I use the same training configuration and code for finetuning, but with training data consisting of the voice of the character I want to clone? How much data is suitable for this purpose?
If I interrupted the training while training the duration, pitch, and energy attribute predictors, and want to resume training from the last checkpoint, do I need to fill in the warm_start_name with the name of the last checkpoint? Is the warmstart_checkpoint_path key used to resume training for the decoder? What is the difference between warmstart_checkpoint_path and warm_start_name?

uberduck-ai / uberduck-ml-dev

Question about zero-shot TTS #165