Implementation of TTS based on paper P-Flow: A Fast and Data-Efficient Zero-Shot TTS through Speech Prompting. You can check main differences between implementation and paper in Differences section.
I have two goals to achieve in this project. It seems work but, really poor at Japanese and numbers.
Generated Samples from model trained on LibriTTS-R, korean and japanese corpus of AIHub 131 datasets. All samples are decoded with MultiBand-Diffusion model from AudioCraft. Pretrained checkpoint used here is available on huggingface.
you can check how to use it in sample notebook.
Currently, speaker embedding of multi-lingual model seems to be highly entangled with language info. it shows worse zero-shot capability. I'm planning to train new model with language ID to reduce language bias in speaker embedding.
Code-switch Text: There's famous japanese sentence, つきがきれいですね, which means 나는 당신을 사랑합니다.
English Prompt Generation
https://github.com/seastar105/pflow-encodec/assets/30820469/57a0450b-e1b2-48b6-b0ec-9433806edb10
Japanese Prompt Generation
https://github.com/seastar105/pflow-encodec/assets/30820469/bf5e4c29-2545-411a-adbc-b461a5c2cefa
Korean Prompt Generation
https://github.com/seastar105/pflow-encodec/assets/30820469/74f2ff7a-554d-4797-9841-a8b7b74d9fbf
English Text: P-Flow encodec is Text-to-Speech model trained on Encodec latent space, using Flow Matching.
Prompt Audio (from LibriTTS-R)
https://github.com/seastar105/pflow-encodec/assets/30820469/a3c1b3d8-ea94-4cb7-bd21-7226e3fd54b1
Generated Audio
https://github.com/seastar105/pflow-encodec/assets/30820469/1de00f81-4c87-402e-a4bc-66deb29c194d
Japanese Text: こんにちは、初めまして。あなたの名前はなんですか?これは音声合成モデルから作られた音声です。
Prompt Audio (from JSUT)
https://github.com/seastar105/pflow-encodec/assets/30820469/fb4f1a10-fb8b-413e-8bec-d1d0f58d8423
Generated Audio
https://github.com/seastar105/pflow-encodec/assets/30820469/137d4e34-f674-4681-a652-93c4a44f4554
Korean Text: 백남준은 미디어 아트의 개척자로서 다양한 테크놀로지를 이용하여 실험적이고 창의적으로 작업했다.
Prompt Audio (from KSS)
https://github.com/seastar105/pflow-encodec/assets/30820469/db3435d0-8e8f-45ef-b3b3-a164ad316d71
Generated Audio
https://github.com/seastar105/pflow-encodec/assets/30820469/8dff38ec-a2d7-49a6-80fb-de6012b33a1b
I've developed in WSL, Windows 11. I have not tested on other platforms and torch version. I recommend using conda environment.
conda create -n pflow-encodec -y python=3.10
conda activate pflow-encodec
conda install -y pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -y -c conda-forge libsndfile==1.0.31
pip install -r requirements.txt
pip install -r infer-requirements.txt
First of all, you need to prepare tsv file, which contains three columns: audio_path
, text
, duration
. each column is separated by tab.
audio_path
is path to audio file, text
is transcript of audio file, and duration
is duration of audio file in seconds.
audio_path text duration
/path/to/audio1.wav Hello, World! 1.5
/path/to/audio2.wav 안녕하세요, 세계! 2.0
/path/to/audio3.wav こんにちは、世界! 2.5
Here, use encodec latent as output, and duration per token as target of duration predictor.
you can dump encodec latent and sentencepiece token durations with following command.
python scripts/dump_durations.py --input_tsv <meta_tsv_file>
python scripts/dump_latents.py --input_tsv <meta_tsv_file>
this command requires GPU and scripts/dump_durations.py
may require more than 8GB of GPU memory.
scripts/dump_durations.py
takes about 6 hours for 1000 hours of audio files. scripts/dump_latents.py
takes about 4 hours for 1000 hours of audio files. both time was measured on RTX 4090.
each script will make two files per audio file:
<audio_path stem>.latent.npy
and <audio_path stem>.duration.npy
.
NOTE: scripts/dump_latents.py
will print out global mean and std of dataset's latent. You should keep it since this value is used for training model.
Now, you can start training.
Repository's code is based on lightning-hydra-template.
After preparing dataset, you can start training after setting dataset config and experiment config. Let your dataset name be new_dataset
. first you need to set dataset config in configs/data/new_dataset.yaml
.
_target_: pflow_encodec.data.datamodule.TextLatentLightningDataModule
train_tsv_path: <train_tsv_path>
val_tsv_path: <val_tsv_path>
add_trailing_silence: True
batch_durations: 50.0 # mini-batch duration in seconds
min_duration: 3.5 # minimum duration of files, this value MUST be bigger than 3.0
max_duration: 15.0
boundaries: [3.0, 5.0, 7.0, 10.0, 15.0]
num_workers: 8
return_upsampled: False
max_frame: 1500 # 20s
text2latent_rate: 1.5 # 50Hz:75Hz
mean: <mean>
std: <std>
fill <train_tsv_path>
, <val_tsv_path>
, <mean>
, and <std>
with your dataset's meta path and mean/std values.
then, create config in configs/experiment/new_dataset.yaml
based on configs/experiment/default.yaml
.
# @package _global_
defaults:
- override /data: new_dataset.yaml # your dataset config name here!!!
- override /model: pflow_base.yaml
- override /callbacks: default.yaml
- override /trainer: gpu.yaml
- override /logger: tensorboard.yaml
task_name: pflow
tags: ["pflow"]
seed: 998244353
test: False
callbacks:
val_checkpoint:
filename: "val_latent_loss_{val/latent_loss:.4f}-{step:06d}"
monitor: val/latent_loss
mode: "min"
model:
scheduler:
total_steps: ${trainer.max_steps}
pct_start: 0.02
sample_freq: 5000
sample_idx: [] # sample indices used for sampling while train. idx will be used to choose samples from validation dataset. so this value should not be greater than len(val_dataset)
mean: ${data.mean}
std: ${data.std}
trainer:
max_steps: 500000
max_epochs: 10000 # arbitrary large number
precision: bf16-mixed # you should check if your GPU supports bf16
accumulate_grad_batches: 4 # effective batch size
gradient_clip_val: 0.2
num_nodes: 1
devices: 1
hydra:
run:
dir: <fill experiment result path>
now you can run training with following command.
python pflow_encodec/train.py experiment=new_dataset
NOTE: If you want to train model with multiple GPUs, you should adjust trainer.num_nodes and trainer.devices in experiment config. Also you should set trainer.use_distributed_sampler to be False. For more detailed information, check out Pytorch Lightning's documents.
Example of single node 4 gpus
trainer:
num_nodes: 1
devices: 4
use_distributed_sampler: False
Language | Weights | Model Card |
---|---|---|
MultiLingual(EJK) | 🤗 Hub | Link |
English | 🤗 Hub | |
Japanese | 🤗 Hub | |
Korean | 🤗 Hub |
I did not conduct ablation studies for each changes due to lack of resources.