Confusion on the prompt diversity

jishengpeng / TextrolSpeech

TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models (2024 ICASSP)

MIT License

147 stars 5 forks source link

Confusion on the prompt diversity #2

Open KeiKinn opened 4 months ago

KeiKinn commented 4 months ago

Hi, thank you for your great job.

I read the paper and downloaded the dataset, but still not fully understand '500 distinct natural text description'. It seems a very important statement in your paper. How does it come? How you define 'diversity' for every style? The audios that have same 'gender', 'pitch'... have different style prompt? Could you please explain it more clearly?

jishengpeng commented 4 months ago

Thank you for your attention. Prompt diversity refers to the inclusion of 500 distinct natural text prompts for each style (such as high speech rate, high pitch, and low energy speech). This approach is fundamental to the design of the TextToSpeech (Textrolspeech) system. You can observe this by downloading the files in the Val directory. In contrast, the style descriptions in PromptTTS contain only a few text prompts for each style, which is insufficient for effective model training.