Closed Beinsezii closed 2 months ago
One way of doing this could be to batch the generation, but only return the first response and cache the others. If the same request is made again, just return the next cached response instead of generating again. That would make it work with any front-end automatically, by just regenerating normally.
Losing streaming would be rough, is that a fundamental limitation?
One way of doing this could be to batch the generation, but only return the first response and cache the others. If the same request is made again, just return the next cached response instead of generating again. That would make it work with any front-end automatically, by just regenerating normally.
I disagree. You should have all options at once, else you'll just waste compute 90% of the time. Perhaps something like SillyTavern's "swipes" would be better than a KAI multi-choice?
Losing streaming would be rough, is that a fundamental limitation?
Every implementation I've used hasn't allowed text streaming during batched text generation. If it's not outright impossible it's probably very difficult to implement in a sane way. The WebUI already has severe performance issues when updating large buffers during text streaming, I don't think its worth exacerbating with 4 or 8 simultaneous streams.
The UI wouldn't need to show every stream in parallel, just the currently-selected response.
The UI wouldn't need to show every stream in parallel, just the currently-selected response.
I actually just set up vLLM w/ SillyTavern to test this and it can in fact stream the active swipe while the others generate in the background. BF16 Llama3 8B runs @ 96 T/S with a batch of 4 on a 7900 XTX.
Bumping the batch up to 16 I can hit almost like 380 T/S concurrent or 24 T/S per response.
Offering it through the API would also help when being used as the server for Arrows: https://github.com/p-e-w/arrows
Exllama2 and llama.cpp have both been working on batched throughput lately. Using Llama.cpp's
batched
binary withn_parallel=8
, I can generate 8 unique replies at 450 T/S compared to 70 T/S withn_parallel=1
. This effectively means you can create 8 responses for the latency of 1.25.I feel like this would be a good feature to utilize in the WebUI's API and Gradio frontend. Way back in the day, the OG KoboldAI had a feature to generate multiple replies at once and present them as a choice. Having something like that but accelerated to the point of being almost the speed of a single reply would be outstanding.
The only real caveat as far as I know is you can't stream tokens this way.