triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.04k stars 1.44k forks source link

Large latency when use `tritonclient.http.aio.infer` #7343

Open gesanqiu opened 3 months ago

gesanqiu commented 3 months ago

Description A clear and concise description of what the bug is. When infer with response = await client.infer(), it takes a long time for triton server to release the output. To be more precise, triton server will hold the request's output buffer for a long time before seeting state from EXECUTING to RELEASED, and then response. Only 2 seconds for a sync infer, but 5 seconds for an await async infer.

=========sync infer log: cost 1 second for http release============

I0612 01:24:07.399564 767 http_server.cc:4522] HTTP request: 2 /v2/models/SwinIR_realSR_s64w8_4x_512x512/infer
I0612 01:24:07.399802 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED
I0612 01:24:07.399826 767 infer_request.cc:900] [request id: <id_unknown>] prepared: [0x0x7f9fd4002ad0] request id: , model: SwinIR_realSR_s64w8_4x_512x512, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f9fd4004978] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
override inputs:
inputs:
[0x0x7f9fd4004978] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
original requested outputs:
requested outputs:
output_0

I0612 01:24:07.399865 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to PENDING
I0612 01:24:07.400051 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from PENDING to EXECUTING
I0612 01:24:07.400112 767 tensorrt.cc:390] model SwinIR_realSR_s64w8_4x_512x512, instance SwinIR_realSR_s64w8_4x_512x512_0, executing 1 requests
I0612 01:24:07.400129 767 instance_state.cc:361] TRITONBACKEND_ModelExecute: Issuing SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:24:07.400143 767 instance_state.cc:410] TRITONBACKEND_ModelExecute: Running SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:24:07.400270 767 instance_state.cc:1450] Optimization profile default [0] is selected for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:24:07.400324 767 pinned_memory_manager.cc:198] pinned memory allocation: size 3145728, addr 0x7fa704000090
I0612 01:24:07.400921 767 instance_state.cc:911] Context with profile default [0] is being executed for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:24:08.091022 767 infer_response.cc:174] add response output: output: output_0, type: FP32, shape: [1,3,2048,2048]
I0612 01:24:08.091061 767 http_server.cc:1217] HTTP: unable to provide 'output_0' in GPU, will use CPU
I0612 01:24:08.091093 767 http_server.cc:1237] HTTP using buffer for: 'output_0', size: 50331648, addr: 0x7f9f8bfff040
I0612 01:24:08.091108 767 pinned_memory_manager.cc:198] pinned memory allocation: size 50331648, addr 0x7fa7043000a0
I0612 01:24:09.044377 767 http_server.cc:1311] HTTP release: size 50331648, addr 0x7f9f8bfff040
I0612 01:24:09.044443 767 infer_request.cc:131] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I0612 01:24:09.044455 767 instance_state.cc:1307] TRITONBACKEND_ModelExecute: model SwinIR_realSR_s64w8_4x_512x512_0 released 1 requests
I0612 01:24:09.044461 767 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7fa7043000a0
I0612 01:24:09.044474 767 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7fa704000090

============async infer log: cost 4 seconds for http release==================

I0612 01:43:29.902539 814 http_server.cc:4522] HTTP request: 2 /v2/models/SwinIR_realSR_s64w8_4x_512x512/infer
I0612 01:43:29.902675 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to INITIALIZED
I0612 01:43:29.902687 814 infer_request.cc:900] [request id: <id_unknown>] prepared: [0x0x7f400c003200] request id: , model: SwinIR_realSR_s64w8_4x_512x512, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7f400c013c58] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
override inputs:
inputs:
[0x0x7f400c013c58] input: input_0, type: FP32, original shape: [1,3,512,512], batch + shape: [1,3,512,512], shape: [3,512,512]
original requested outputs:
requested outputs:
output_0

I0612 01:43:29.902714 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from INITIALIZED to PENDING
I0612 01:43:29.902825 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from PENDING to EXECUTING
I0612 01:43:29.902941 814 tensorrt.cc:390] model SwinIR_realSR_s64w8_4x_512x512, instance SwinIR_realSR_s64w8_4x_512x512_0, executing 1 requests
I0612 01:43:29.902997 814 instance_state.cc:361] TRITONBACKEND_ModelExecute: Issuing SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:43:29.903022 814 instance_state.cc:410] TRITONBACKEND_ModelExecute: Running SwinIR_realSR_s64w8_4x_512x512_0 with 1 requests
I0612 01:43:29.903180 814 instance_state.cc:1450] Optimization profile default [0] is selected for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:43:29.903245 814 pinned_memory_manager.cc:198] pinned memory allocation: size 3145728, addr 0x7f4736000090
I0612 01:43:29.903906 814 instance_state.cc:911] Context with profile default [0] is being executed for SwinIR_realSR_s64w8_4x_512x512_0
I0612 01:43:30.594829 814 infer_response.cc:174] add response output: output: output_0, type: FP32, shape: [1,3,2048,2048]
I0612 01:43:30.594875 814 http_server.cc:1217] HTTP: unable to provide 'output_0' in GPU, will use CPU
I0612 01:43:30.594912 814 http_server.cc:1237] HTTP using buffer for: 'output_0', size: 50331648, addr: 0x7f3fd3fff040
I0612 01:43:30.594927 814 pinned_memory_manager.cc:198] pinned memory allocation: size 50331648, addr 0x7f47363000a0
I0612 01:43:34.582690 814 http_server.cc:1311] HTTP release: size 50331648, addr 0x7f3fd3fff040
I0612 01:43:34.582782 814 infer_request.cc:131] [request id: <id_unknown>] Setting state from EXECUTING to RELEASED
I0612 01:43:34.582798 814 instance_state.cc:1307] TRITONBACKEND_ModelExecute: model SwinIR_realSR_s64w8_4x_512x512_0 released 1 requests
I0612 01:43:34.582807 814 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f47363000a0
I0612 01:43:34.582828 814 pinned_memory_manager.cc:226] pinned memory deallocation: addr 0x7f4736000090

Triton Information What version of Triton are you using?

Are you using the Triton container or did you build it yourself? I'm useing docker image nvcr.io/nvidia/tritonserver:24.03-py3 To Reproduce Steps to reproduce the behavior. Start a triton server, and perform a sync and an async infer request separately. Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). I'm useing SwinIR model, you can download it from here. Expected behavior A clear and concise description of what you expected to happen. async client should response as quick as sync client.

tangxueduo commented 3 months ago

is it solved? i met the same question with this.

gesanqiu commented 1 month ago

is it solved? i met the same question with this.

You can try gRPC client, it works for me.

nightflight-dk commented 1 month ago

gRPC does not have good support on Azure ML today, addressing the HTTP latency issue would be very desirable

benjaminlinken commented 2 weeks ago

You can set the header to {"Accept-Encoding":""} when using an HTTP asynchronous interface. this can effectively reduce the interface latency.