line / LibriTTS-P

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning
112 stars 2 forks source link

Reproducing the training setup for the TTS system in the LibriTTS-P paper #2

Open ajd12342 opened 1 month ago

ajd12342 commented 1 month ago

Hello! Thank you for the LibriTTS-P dataset release and paper. I am interested in reproducing your prompt-based controllable TTS system and had a few questions about how you select the appropriate speaker prompts for each utterance in the dataset.

  1. The paper explains that 3 different annotators labelled each speaker with perception and impression words, which means there are 3 annotations for each speaker. How did you select which annotation to use for each utterance for a given speaker?
  2. Could you release the list of templates you used, such as 'The speaker’s identity can be described as...' and 'Descriptions of the speaker’s vocal style are...'?

Thanks in advance!

r9y9 commented 1 week ago

Hi, sorry for being late. I will answer your questions:

  1. AFAIK, during training iteration, one annotator is randomly sampled with equal probability for each utterance.
  2. See https://github.com/line/promptttspp/blob/3e6bd0eaa7d0bfadb5f33a530726dd78efc748dd/promptttspp/datasets/all_with_spk_prompt_norm.py#L141-L159

For the full details of training the PromptTTS++ baseline system, see https://github.com/line/promptttspp