samples
folder. LJSpeech pretrained ckpt - GDrive LinkUnofficial implementation of the paper P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting by NVIDIA.
While recent large-scale neural codec language models have shown significant improvement in zero-shot TTS by training on thousands of hours of data, they suffer from drawbacks such as a lack of robustness, slow sampling speed similar to previous autoregressive TTS methods, and reliance on pre-trained neural codec representations. Our work proposes P-Flow, a fast and data-efficient zero-shot TTS model that uses speech prompts for speaker adaptation. P-Flow comprises a speechprompted text encoder for speaker adaptation and a flow matching generative decoder for high-quality and fast speech synthesis. Our speech-prompted text encoder uses speech prompts and text input to generate speaker-conditional text representation. The flow matching generative decoder uses the speaker-conditional output to synthesize high-quality personalized speech significantly faster than in real-time. Unlike the neural codec language models, we specifically train P-Flow on LibriTTS dataset using a continuous mel-representation. Through our training method using continuous speech prompts, P-Flow matches the speaker similarity performance of the large-scale zero-shot TTS models with two orders of magnitude less training data and has more than 20× faster sampling speed. Our results show that P-Flow has better pronunciation and is preferred in human likeness and speaker similarity to its recent state-of-the-art counterparts, thus defining P-Flow as an attractive and desirable alternative.
cd pflowtts_pytorch/notebooks
import sys
sys.path.append('..')
from pflow.models.pflow_tts import pflowTTS
import torch
from dataclasses import dataclass
@dataclass
class DurationPredictorParams:
filter_channels_dp: int
kernel_size: int
p_dropout: float
@dataclass
class EncoderParams:
n_feats: int
n_channels: int
filter_channels: int
filter_channels_dp: int
n_heads: int
n_layers: int
kernel_size: int
p_dropout: float
spk_emb_dim: int
n_spks: int
prenet: bool
@dataclass
class CFMParams:
name: str
solver: str
sigma_min: float
# Example usage
duration_predictor_params = DurationPredictorParams(
filter_channels_dp=256,
kernel_size=3,
p_dropout=0.1
)
encoder_params = EncoderParams(
n_feats=80,
n_channels=192,
filter_channels=768,
filter_channels_dp=256,
n_heads=2,
n_layers=6,
kernel_size=3,
p_dropout=0.1,
spk_emb_dim=64,
n_spks=1,
prenet=True
)
cfm_params = CFMParams(
name='CFM',
solver='euler',
sigma_min=1e-4
)
@dataclass
class EncoderOverallParams:
encoder_type: str
encoder_params: EncoderParams
duration_predictor_params: DurationPredictorParams
encoder_overall_params = EncoderOverallParams(
encoder_type='RoPE Encoder',
encoder_params=encoder_params,
duration_predictor_params=duration_predictor_params
)
@dataclass
class DecoderParams:
channels: tuple
dropout: float
attention_head_dim: int
n_blocks: int
num_mid_blocks: int
num_heads: int
act_fn: str
decoder_params = DecoderParams(
channels=(256, 256),
dropout=0.05,
attention_head_dim=64,
n_blocks=1,
num_mid_blocks=2,
num_heads=2,
act_fn='snakebeta',
)
model = pflowTTS(
n_vocab=100,
n_feats=80,
encoder=encoder_overall_params,
decoder=decoder_params.__dict__,
cfm=cfm_params,
data_statistics=None,
)
x = torch.randint(0, 100, (4, 20))
x_lengths = torch.randint(10, 20, (4,))
y = torch.randn(4, 80, 500)
y_lengths = torch.randint(300, 500, (4,))
dur_loss, prior_loss, diff_loss, attn = model(x, x_lengths, y, y_lengths)
# backpropagate the loss
# now synthesises
x = torch.randint(0, 100, (1, 20))
x_lengths = torch.randint(10, 20, (1,))
y_slice = torch.randn(1, 80, 264)
model.synthesise(x, x_lengths, y_slice, n_timesteps=10)
conda create -n pflowtts python=3.10 -y
conda activate pflowtts
Stay in the root directory (of course clone the repo first!)
cd pflowtts_pytorch
pip install -r requirements.txt
# Cython-version Monotonoic Alignment Search
python setup.py build_ext --inplace
Let's assume we are training with LJ Speech
data/LJSpeech-1.1
, and prepare the file lists to point to the extracted data like for item 5 in the setup of the NVIDIA Tacotron 2 repo.3a. Go to configs/data/ljspeech.yaml
and change
train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt
3b. Helper commands for the lazy
!mkdir -p /home/ubuntu/LJSpeech/LJSpeech-1.1/filelists
!wget -O /home/ubuntu/LJSpeech/LJSpeech-1.1/filelists/ljs_audio_text_test_filelist.txt https://raw.githubusercontent.com/NVIDIA/tacotron2/master/filelists/ljs_audio_text_test_filelist.txt
!wget -O /home/ubuntu/LJSpeech/LJSpeech-1.1/filelists/ljs_audio_text_train_filelist.txt https://raw.githubusercontent.com/NVIDIA/tacotron2/master/filelists/ljs_audio_text_train_filelist.txt
!wget -O /home/ubuntu/LJSpeech/LJSpeech-1.1/filelists/ljs_audio_text_val_filelist.txt https://raw.githubusercontent.com/NVIDIA/tacotron2/master/filelists/ljs_audio_text_val_filelist.txt
!sed -i -- 's,DUMMY,/home/ubuntu/LJSpeech/LJSpeech-1.1/wavs,g' /home/ubuntu/LJSpeech/LJSpeech-1.1/filelists/*.txt
!sed -i -- 's,train_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt,train_filelist_path: /home/ubuntu/LJSpeech/LJSpeech-1.1/filelists/ljs_audio_text_train_filelist.txt,g' /home/ubuntu/LJSpeech/pflowtts_pytorch/configs/data/ljspeech.yaml
!sed -i -- 's,valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txt,valid_filelist_path: /home/ubuntu/LJSpeech/LJSpeech-1.1/filelists/ljs_audio_text_val_filelist.txt,g' /home/ubuntu/LJSpeech/pflowtts_pytorch/configs/data/ljspeech.yaml
cd pflowtts_pytorch/pflow/utils
python generate_data_statistics.py -i ljspeech.yaml
# Output:
#{'mel_mean': -5.53662231756592, 'mel_std': 2.1161014277038574}
Update these values in configs/data/ljspeech.yaml
under data_statistics
key.
data_statistics: # Computed for ljspeech dataset
mel_mean: -5.536622
mel_std: 2.116101
to the paths of your train and validation filelists.
python pflow/train.py experiment=ljspeech
python pflow/train.py experiment=ljspeech trainer.devices=[0,1]
notebooks
for a quick dry run and architecture testing of the model.samples
folder. (LJSpeech trained for 300k steps) Paper recommends 800k steps.min_sample_size
(default is 4s; so that prompt is at least 3s and prediction is at least 1s)prompt_size
in the configs (models/pflow.yaml) to control the size of the prompt. It is number of mel frames to be used as prompt from the wav sample at traning. (default is ~3s; 3*22050//256 = 258; rounded to 264 in configs) dur_p_use_log
in the configs (models/pflow.yaml) to control whether to use log of duration prediction or not for loss calculation. (default is False now) My hypothesis is that log durations mse loss doesn't work well for longer pauses etc (due to nature of log function). So, we just e
power the log durations before calculating the loss. Alternative way can be using relu instead of log.transfer_ckpt_path
in the configs (train.yaml) to control the path of the ckpt to be used for transfer learning. (default is None) If None, then the model is trained from scratch. If not None, then the model is loaded from the ckpt path and trained from step 0. In case, ckpt_path
is also not None, then the model is loaded from ckpt_path
and trained from the step it was saved at. transfer_ckpt_path
can handle layer size mismatches between the ckpt and the model.export.py
and inference.py
to export and test the model. (the arguemnts are self-explanatory)