triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.08k stars 1.45k forks source link

GRPC Streaming, Stream on Client and One response from Server #6301

Open AniForU opened 1 year ago

AniForU commented 1 year ago

Is your feature request related to a problem? Please describe. I aim to deploy my ASR model on a server that will receive audio packet bytes with each request. The server will then transcribe the incoming data incrementally and provide a response when it determines there are no more data packets.

Describe the solution you'd like I intend to create a server-side script that will receive data in incoming requests and withhold the response until it detects that there are no more pending data packets to process.

Describe alternatives you've considered I attempted to implement Sequence Batching but faced challenges with this approach. I'm considering an alternative approach, which involves launching my own gRPC server within the model application.

Additional context I am looking to make my GRPC server method look like below

rpc RecordRoute(stream Point) returns (RouteSummary) {}

oandreeva-nv commented 1 year ago

Hi @AniForU , could you please clarify what issues you are having with Triton? What version you are using?

AniForU commented 1 year ago

hi @oandreeva-nv I need to develop a backend model that can continually accumulate incoming audio chunks, transcribe them on the server side in real-time, and only send the complete transcript back to the client when the model decides it's necessary.

To simulate this scenario, I've created a Python backend model. It accepts numbers as input, stores them in a dictionary with their respective sequence IDs for inter-request sharing, and calculates the sum for numbers received within a one-sequence window. The model returns the sum when it detects that the Sequence End flag has been set to True.

I've also implemented the sequence batching concept using Triton to ensure that all requests with the same sequence ID are routed to the same instance. However, I've encountered an issue where I must provide an empty response from the server's Execute method because it expects some return response. Similarly, on the client side, I need to iterate through all the responses.

Could you suggest me how to make it not mandatory to send response back?