huggingface / parler-tts

Inference and training library for high-quality TTS models.
Apache License 2.0
4.67k stars 474 forks source link

Audio Length Limitation and FlashAttention Warning in Parler TTS #126

Open suman819 opened 2 months ago

suman819 commented 2 months ago

I have been working with Parler TTS and encountered an issue where I am unable to generate audio longer than 20 seconds. Despite trying various methods, such as streaming and splitting the text into chunks, the audio output is still truncated to around 15-20 seconds.

Additionally, I received a warning stating that FlashAttention is not installed. Could this be the cause of the issue? I would appreciate any guidance or suggestions on how to handle longer input text effectively.

dhaivat1729 commented 2 months ago

I have the same issue. Audio length is truncated.

kunci115 commented 2 months ago

the training default configuration in parler-tts is max 30sec, max text length 600 https://github.com/huggingface/parler-tts/blob/main/training/README.md#3-training

either you fine tune it with longer data, or send it in split if you're text > 30 sec or text length > 600 sec, just split it by (.,)

suman819 commented 2 months ago

I have already applied the suggested method of splitting the text if it exceeds 30 seconds or 600 characters by using punctuation (.,). However, when I combine the audio segments, there is an inconsistency in the voice tone, even when a specific voice prompt is set.

b-feldmann commented 2 months ago

I could get it to work with this PR: https://github.com/huggingface/parler-tts/pull/110

The main idea is to generate once with a small prompt like "This is my prefix prompt." and storing the encoded result. Afterward generate a lot of sentences like with the encoded result passed as decoder_input_ids:

You then need to remove the encoded audio from each output to get consistent results without the prefix prompt

cesinsingapore commented 2 months ago

do you guys experiencing not fluent(in strange way) when parler inferencing number and letter ? for example: "my id card is 5o613123jkl"

Guppy16 commented 2 months ago

Perhaps u can also experiment with the min_new_tokens parameter. I believe in ParlerTTS, a single audio token represents ~12 ms of audio, so if you want to generate 20 secs, that would be 1720 tokens.

model.generate(min_new_tokens=1720, **generation_kwargs)