Open gesanqiu opened 3 months ago
is it solved? i met the same question with this.
is it solved? i met the same question with this.
You can try gRPC client, it works for me.
gRPC does not have good support on Azure ML today, addressing the HTTP latency issue would be very desirable
You can set the header to {"Accept-Encoding":""} when using an HTTP asynchronous interface. this can effectively reduce the interface latency.
Description A clear and concise description of what the bug is. When infer with
response = await client.infer()
, it takes a long time for triton server to release the output. To be more precise, triton server will hold the request's output buffer for a long time before seeting state from EXECUTING to RELEASED, and then response. Only 2 seconds for a sync infer, but 5 seconds for an await async infer.=========sync infer log: cost 1 second for http release============
============async infer log: cost 4 seconds for http release==================
Triton Information What version of Triton are you using?
Are you using the Triton container or did you build it yourself? I'm useing docker image
nvcr.io/nvidia/tritonserver:24.03-py3
To Reproduce Steps to reproduce the behavior. Start a triton server, and perform a sync and an async infer request separately. Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). I'm useing SwinIR model, you can download it from here. Expected behavior A clear and concise description of what you expected to happen. async client should response as quick as sync client.