huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
4.23k stars 418 forks source link

Audio Length Limitation and FlashAttention Warning in Parler TTS #126

Open suman819 opened 3 weeks ago

suman819 commented 3 weeks ago

I have been working with Parler TTS and encountered an issue where I am unable to generate audio longer than 20 seconds. Despite trying various methods, such as streaming and splitting the text into chunks, the audio output is still truncated to around 15-20 seconds.

Additionally, I received a warning stating that FlashAttention is not installed. Could this be the cause of the issue? I would appreciate any guidance or suggestions on how to handle longer input text effectively.

dhaivat1729 commented 2 weeks ago

I have the same issue. Audio length is truncated.

kunci115 commented 2 weeks ago

the training default configuration in parler-tts is max 30sec, max text length 600 https://github.com/huggingface/parler-tts/blob/main/training/README.md#3-training

either you fine tune it with longer data, or send it in split if you're text > 30 sec or text length > 600 sec, just split it by (.,)

suman819 commented 2 weeks ago

I have already applied the suggested method of splitting the text if it exceeds 30 seconds or 600 characters by using punctuation (.,). However, when I combine the audio segments, there is an inconsistency in the voice tone, even when a specific voice prompt is set.

b-feldmann commented 1 week ago

I could get it to work with this PR: https://github.com/huggingface/parler-tts/pull/110

The main idea is to generate once with a small prompt like "This is my prefix prompt." and storing the encoded result. Afterward generate a lot of sentences like with the encoded result passed as decoder_input_ids:

You then need to remove the encoded audio from each output to get consistent results without the prefix prompt

cesinsingapore commented 1 week ago

do you guys experiencing not fluent(in strange way) when parler inferencing number and letter ? for example: "my id card is 5o613123jkl"