Batched/multi replies - Githubissues

oobabooga / text-generation-webui

A Gradio web UI for Large Language Models.

GNU Affero General Public License v3.0

40.63k stars 5.31k forks source link

Batched/multi replies #6021

Closed Beinsezii closed 2 months ago

Beinsezii commented 6 months ago

Exllama2 and llama.cpp have both been working on batched throughput lately. Using Llama.cpp's batched binary with n_parallel=8, I can generate 8 unique replies at 450 T/S compared to 70 T/S with n_parallel=1. This effectively means you can create 8 responses for the latency of 1.25.

I feel like this would be a good feature to utilize in the WebUI's API and Gradio frontend. Way back in the day, the OG KoboldAI had a feature to generate multiple replies at once and present them as a choice. Having something like that but accelerated to the point of being almost the speed of a single reply would be outstanding.

The only real caveat as far as I know is you can't stream tokens this way.

zewt commented 6 months ago

One way of doing this could be to batch the generation, but only return the first response and cache the others. If the same request is made again, just return the next cached response instead of generating again. That would make it work with any front-end automatically, by just regenerating normally.

Losing streaming would be rough, is that a fundamental limitation?

Beinsezii commented 6 months ago

One way of doing this could be to batch the generation, but only return the first response and cache the others. If the same request is made again, just return the next cached response instead of generating again. That would make it work with any front-end automatically, by just regenerating normally.

I disagree. You should have all options at once, else you'll just waste compute 90% of the time. Perhaps something like SillyTavern's "swipes" would be better than a KAI multi-choice?

Losing streaming would be rough, is that a fundamental limitation?

Every implementation I've used hasn't allowed text streaming during batched text generation. If it's not outright impossible it's probably very difficult to implement in a sane way. The WebUI already has severe performance issues when updating large buffers during text streaming, I don't think its worth exacerbating with 4 or 8 simultaneous streams.

zewt commented 6 months ago

The UI wouldn't need to show every stream in parallel, just the currently-selected response.

Beinsezii commented 5 months ago

The UI wouldn't need to show every stream in parallel, just the currently-selected response.

I actually just set up vLLM w/ SillyTavern to test this and it can in fact stream the active swipe while the others generate in the background. BF16 Llama3 8B runs @ 96 T/S with a batch of 4 on a 7900 XTX.

Bumping the batch up to 16 I can hit almost like 380 T/S concurrent or 24 T/S per response.

StableLlama commented 4 months ago

Offering it through the API would also help when being used as the server for Arrows: https://github.com/p-e-w/arrows