We introduce LibriTTS-P, a new corpus based on LibriTTS-R that includes utterance-level descriptions (i.e., prompts) of speaking style and speaker-level prompts of speaker characteristics. We employ a hybrid approach to construct prompt annotations: (1) manual annotations that capture human perceptions of speaker characteristics and (2) synthetic annotations on speaking style. Compared to existing English prompt datasets, our corpus provides more diverse prompt annotations for all speakers of LibriTTS-R. Experimental results for prompt-based controllable TTS demonstrate that the TTS model trained with LibriTTS-P achieves higher naturalness than the model using the conventional dataset. Furthermore, the results for style captioning tasks show that the model utilizing LibriTTS-P generates 2.5 times more accurate words than the model using a conventional dataset.
You can check the paper and demo page.
There are files related to LibriTTS-P under the data
directory.
The details of each file are as follows:
df1_en.csv
, df2_en.csv
, df3_en.csv
excluded_spk_list.txt
unannotated_spk_list.txt
style_prompt_candidates_v230922.csv
M: male
p-low: pitch is low
s-slow: speaking speed is slow
e-low: loudness is low
metadata_w_style_prompt_tags_v230922.csv
The details of the columns in this CSV file are as follows: | Name | Description |
---|---|---|
item_name | Name of the audio file | |
spk_id | Speaker ID | |
gender | Gender of the speaker | |
pitch | Pitch level of the audio | |
speaking_speed | Speaking speed level | |
energy | Energy level of the audio | |
content_prompt | Content prompt corresponding to the audio | |
style_prompt_key | Key for style_prompt_candidates_v230922.csv , indicating the style prompt associated with the audio. |
|
raw_f0_mean | Average F0 of the voiced parts of the audio | |
raw_f0_scale | Standard deviation of the F0 | |
raw_lf0_mean | Average of the log-F0 for the voiced parts | |
raw_lf0_scale | Standard deviation of the logarithm of the log-F0 | |
raw_speaking_rate | The number of syllables per second | |
raw_loudness_lufs | Loudness units relative to full scale | |
raw_loudness_mean | Average loudness of the audio file calculated per frame, providing an average measure of the loudness over time. | |
raw_loudness_scale | Standard deviation of the frame loudness values, indicating the variability of loudness across the audio frames. | |
invalid | Flag indicating whether the utterance has been marked as invalid due to missing F0, an invalid speaking rate (e.g., speaking_rate < 0), or other processing errors. 1 means invalid, and 0 means valid. |
(For detailed calculation methods of each item, please refer to LibriTTS-P paper.)
You can use audio from LibriTTS-R.
@inproceedings{librittsp,
authors={Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, Kentaro Tachibana},
title={LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning},
booktitle={Proc. Interspeech 2024},
month=sep,
year=2024
}