hyperonym / basaran

Basaran is an open-source alternative to the OpenAI text completion API. It provides a compatible streaming API for your Hugging Face Transformers-based text generation models.
MIT License
1.29k stars 81 forks source link

Inference should stop if connection is aborted/closed #192

Closed josephrocca closed 1 year ago

josephrocca commented 1 year ago

For chat use cases on consumer hardware this is basically a show-stopper. The user needs to be able to stop a response, because consumer on-device inference is quite slow, and so if they don't like where a generation is headed, then they can stop the response in the chat UI (which aborts the HTTP request), but they'll need to wait for the response to finish behind the scenes so that their processor is free to write another response (and I'm not sure how they'd find out when it is actually finished, other than looking at their CPU usage).

peakji commented 1 year ago

Apologies for the late reply.

Basaran's event-stream implementation is based on Python generators. When a user terminates the request, the server does not generate complete completions and then stop the computation. Instead, it only completes the calculation for the next token that is currently running.

You can verify this behavior by using the 'stop' button in the Basaran playground.

josephrocca commented 1 year ago

I just tried replicating the problem (CPU usage of all cores at max tens of seconds after stopping generation), and I couldn't, so I must have accidentally had a request running in the background somewhere when I reported this. Sorry!!