triton-inference-server / fastertransformer_backend

BSD 3-Clause "New" or "Revised" License
411 stars 133 forks source link

How to terminate a grpc streaming request immediately during tritonserver inference with a FasterTransformer backend? #139

Open songkq opened 1 year ago

songkq commented 1 year ago

In a production environment like ChatGPT, early termination of a conversation based on user-client commands can be a major requirement. I'm wondering whether a grpc streaming request can be terminated immediately during tritonserver inference with a FasterTransformer backend? Could you please give some advice?

with grpcclient.InferenceServerClient(self.model_url) as client:
        client.start_stream(callback=partial(stream_callback, result_queue))
        client.async_stream_infer(self.model_name, request_data)
bigmover commented 1 year ago

async_stream_infer maybe need a package_input?