Generic TTS audio streaming feature

Sascha353 commented 9 months ago

In short: It's about tts generation and audio playback during text-generation instead of waiting till the whole response is generated by the LLM.

"In longer": I'd love to see a feature which makes it possible to support audio streaming in text-generation-webui, as it would increase response time especially for longer answers and if it would be a generic solution, it could potentially solve the problem with tts engines not being able to handle long text inputs. However I think there needs to be some essential changes to the way streaming works in text-generatio-webui, as well as how the returned audio is handled. I think the changes required could potentially be utilized by all tts engines. The following things are important in my opinion and from a very high level view (as I'm not a dev):

1. streaming mode: text-generation-webui streaming mode has nothing to do with the tts engine being able to stream the response audio. In fact text-generation-webui streaming mode is incompatible with all tts engines, as far as I can tell. That's why it is necessary to disable streaming mode when using tts. Reason is you can't feed the tts engines individual words, while they are produced by the LLM. So we need to wait at least till the first sentence is finished. For very short sentences like "Ok!" it is most likely better to accumulate more sentences till the input text reaches a predefined amount of tokens.

2. call the tts-engine/handle the returned audio: There are at least two ways to handover the input text to the tts engine, based on the fact if the engine supports streaming or not. If it does not (which would be the generic approach) text-generation just calls the tts engine multiple times for each chunk of sentences. This yields multiple audio files that need to be played in succession AND the playback control needs to support that, as the user should still be able to pause/play/download the wav file (I don't think it's a good idea to add multiple playback controls per each returned audio). There also needs to be some queuing before the tts engine is called and after the audio is returned, so sentence chunks and audio is processed in the correct order. Also if every audio response should be concatenated in text-generation-webui it must be made sure to initially create a wav header and append the audio chunks without further headers, to have one continuous audio response. If the tts engine does support streaming the text-generation-webui needs to support byte streaming which has its own challenges for example buffering, to make sure playback is not getting ahead of the inference stream, and also playback control for an audio stream in the UI etc. (gradio must support that: https://github.com/gradio-app/gradio/issues/5160)

As an example, the coqui space on huggingface does it more or less like that: https://huggingface.co/spaces/coqui/voice-chat-with-mistral

erew123 commented 9 months ago

Sorry, I mis-read some of your text before and I went off on a tangent!

One thing I would like to suggest, if you can process ahead, is that on low VRAM situations, voice generation is slow. It would be great to have the option to use CPU+System RAM as an option for generation.

In some very low VRAM scenarios, I've found CPU generation is about on a par with GPU generation, even sometimes a little faster. (Obviously peoples milage will vary depending on their CPU).

By low VRAM, to load a 13B model on a 12GB card, it uses between 11.4GB to 11.7GB's before you even start thinking about doing TTS. It does still work, but GPU generation in this type scenario goes from a 10-20second time to generate, up to a 60-120 time to generate. TTS on CPU (8x core 16x thread) seems to come in around the 50-120 second mark, so if processed alongside the text generation, it would shorten the wait time before you hear something.

erew123 commented 9 months ago

I have a few other thoughts on this.

1) This will not be good for anyone who has a low VRAM scenario (after loading an AI model) as there may be too much shuffling of the TTS engine and the layers of the model to have any performance benefit (is my suspicion). Though it should be perfectly fine for people who have a couple of GB's free after loading the layers of their model.

I have however possibly hit on a decent performance gain for people with low VRAM situations https://github.com/oobabooga/text-generation-webui/issues/4712

2) I tested a 240 tokens long/1005 characters paragraph out to see what the processing time for the entire paragraph VS individual sentences was like. I performed multiple generations and averaged across the generations. My findings were:

Overall its about 23% slower in total processing time to generate each sentence VS one whole paragraph. However the payoff is that you aren't waiting for the entire paragraph to be generated and so the experience is far more interactive for the user as you are using the time actually listening to something.
2/3rds sentences are processed quickly enough so as not to leave a gap between playback of sentences. The worst gap was 4 seconds, but that's better than waiting for 30-50 seconds before you hear anything. So Id imagine streaming would flow quite well for the most part.
I tried out CPU VS GPU processing (a RTX 4070 vs 8x core 16x thread CPU). CPU processing in this case was about 3.3x slower than GPU. So there would be no benefit in trying to parallel process on a CPU. At least for most people with current-ish CPU's.

Average Processing time whole Paragraph: 42.5 seconds (Before you hear anything spoken) Average Processing time individual sentences (totalled up): 53.7 seconds (but you could start hearing things after maybe 5-10 seconds). Percentage Average difference: 23.3% slower than creating the whole paragraph in one go.

3) Native built in streaming appears to be something coming in a later version of Xtts https://github.com/coqui-ai/TTS/discussions/3197#discussioncomment-7586607 (not sure how soon).

Sascha353 commented 9 months ago

But as long as the RTF is under 1 it doesn't matter how many percent the "chunk inference" is slower then the "normal" inference as playback will take longer anyway. With a GPU and deepspeed the tts RTF was 0.34 in my test with coqui xtts (silero or piper even faster), so plenty of free processing time available to use. I do understand that it is not the same for everyone but so is everything regarding AI, some have the hardware to take advantage of all the options and bigger models and some don't. As with almost all features, it should be optional for sure. So it wont affect users which can't use it and everyone else has a much better experience.

erew123 commented 9 months ago

100% completely agree! I just thought it was worth adding a few thoughts around it so that if anything gets developed, everyone's needs can be covered and we can choose what flavour works for us personally.

github-actions[bot] commented 8 months ago

This issue has been closed due to inactivity for 6 weeks. If you believe it is still relevant, please leave a comment below. You can tag a developer in your comment.

oobabooga / text-generation-webui

Generic TTS audio streaming feature #4706

In short: It's about tts generation and audio playback during text-generation instead of waiting till the whole response is generated by the LLM.