triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

Can i stop execution? (w/ `decoupled mode`) #162

Open Yeom opened 1 year ago

Yeom commented 1 year ago

Description

Docker: nvcr.io/nvidia/tritonserver:23.04-py3
Gpu: A100

How can i stop bi-direction streaming(decoupled mode)?
- I want to stop model inference(streaming response) when the user disconnects or according to certain conditions, but I don't know how to do that at the moment.

Reference
- https://github.com/triton-inference-server/server/issues/4344
- https://github.com/triton-inference-server/server/issues/5833#issuecomment-1561318646

Reproduced Steps

-
shanekong commented 1 year ago

i meet a similar problem. if ft server encouters stop token during generating, but the already generate tokens' length shorter than the max_new_tokens, the ft server will continue reply the same result, but don't stop the streaming.

client.stop_stream() is called, but it will block until the result's lenth equal the max_new_tokens.

is there any way to get out?