triton-inference-server / onnxruntime_backend

The Triton backend for the ONNX Runtime.
BSD 3-Clause "New" or "Revised" License
122 stars 54 forks source link

tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] onnx runtime error 2: not enough space: expected 270080, got 261760 #95

Open piekey1994 opened 2 years ago

piekey1994 commented 2 years ago

Description When I enabled max_queue_delay_microseconds to improve the response speed of the model, I found that there were occasional errors. I set max_queue_delay_microseconds to 70000. Then I sent three tensor of different lengths to the service at the same time. The first request was successful, and the other two requests failed. If I don't configure max_queue_delay_microseconds, it will always be successful. config.pbtxt:


name: "encoder"
backend: "onnxruntime"
default_model_filename: "encoder.onnx"

max_batch_size: 32
input [
  {
    name: "speech"
    data_type: TYPE_FP32
    dims: [-1, 80]
  },
  {
    name: "speech_lengths"
    data_type: TYPE_INT32
    dims: [1]
    reshape: { shape: [ ] }
  }
]

output [
  {
    name: "encoder_out_lens"
    data_type: TYPE_INT32
    dims: [1]
    reshape: { shape: [ ] }
  },
  {
    name: "beam_log_probs"
    data_type: TYPE_FP32
    dims: [-1, 10]
  },
  {
    name: "beam_log_probs_idx"
    data_type: TYPE_INT64
    dims: [-1, 10]
  }
]

dynamic_batching {
    max_queue_delay_microseconds: 70000
}

instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]

triton log:


I0107 07:17:51.879632 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 1 step START
I0107 07:17:51.879667 48160 grpc_server.cc:3150] New request handler for ModelInferHandler, 4
I0107 07:17:51.879680 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.879691 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.879739 48160 infer_request.cc:547] prepared: [0x0x7fa4f40036a0] request id: , model: encoder, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7fa4f4003aa8] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
[0x0x7fa4f4003968] input: speech, type: FP32, original shape: [1,500,80], batch + shape: [1,500,80], shape: [500,80]
override inputs:
inputs:
[0x0x7fa4f4003968] input: speech, type: FP32, original shape: [1,500,80], batch + shape: [1,500,80], shape: [500,80]
[0x0x7fa4f4003aa8] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
original requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens
requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens

I0107 07:17:51.881097 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 4 step START
I0107 07:17:51.881111 48160 grpc_server.cc:3150] New request handler for ModelInferHandler, 5
I0107 07:17:51.881116 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.881121 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.881133 48160 infer_request.cc:547] prepared: [0x0x7fa4f4026ba0] request id: , model: encoder, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7fa4f4026f88] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
[0x0x7fa4f4026e68] input: speech, type: FP32, original shape: [1,422,80], batch + shape: [1,422,80], shape: [422,80]
override inputs:
inputs:
[0x0x7fa4f4026e68] input: speech, type: FP32, original shape: [1,422,80], batch + shape: [1,422,80], shape: [422,80]
[0x0x7fa4f4026f88] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
original requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens
requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens

I0107 07:17:51.881257 48160 onnxruntime.cc:2325] model encoder, instance encoder_0, executing 1 requests
I0107 07:17:51.881270 48160 onnxruntime.cc:1277] TRITONBACKEND_ModelExecute: Running encoder_0 with 1 requests
I0107 07:17:51.883291 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 5 step START
I0107 07:17:51.883303 48160 grpc_server.cc:3150] New request handler for ModelInferHandler, 6
I0107 07:17:51.883308 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.883314 48160 model_repository_manager.cc:615] GetInferenceBackend() 'encoder' version -1
I0107 07:17:51.883327 48160 infer_request.cc:547] prepared: [0x0x7fa4f4028bc0] request id: , model: encoder, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 1, priority: 0, timeout (us): 0
original inputs:
[0x0x7fa4f4028fa8] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
[0x0x7fa4f4028e88] input: speech, type: FP32, original shape: [1,396,80], batch + shape: [1,396,80], shape: [396,80]
override inputs:
inputs:
[0x0x7fa4f4028e88] input: speech, type: FP32, original shape: [1,396,80], batch + shape: [1,396,80], shape: [396,80]
[0x0x7fa4f4028fa8] input: speech_lengths, type: INT32, original shape: [1,1], batch + shape: [1], shape: []
original requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens
requested outputs:
beam_log_probs
beam_log_probs_idx
encoder_out_lens

2022-01-07 07:17:51.913799887 [I:onnxruntime:log, bfc_arena.cc:26 BFCArena] Creating BFCArena for Cuda with following configs: initial_chunk_size_bytes: 1048576 max_dead_bytes_per_chunk: 134217728 initial_growth_chunk_size_bytes: 2097152 memory limit: 18446744073709551615 arena_extend_strategy: 0
2022-01-07 07:17:51.913824649 [V:onnxruntime:log, bfc_arena.cc:62 BFCArena] Creating 21 bins of max chunk size 256 to 268435456
2022-01-07 07:17:51.913839024 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:9 (requested) num_bytes: 160000 (actual) rounded_bytes:160000
2022-01-07 07:17:51.914845468 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1048576 bytes.
2022-01-07 07:17:51.914854356 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 1048576
2022-01-07 07:17:51.914859897 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7fa4c6800000 to 0x7fa4c6900000
2022-01-07 07:17:51.915355558 [I:onnxruntime:, sequential_executor.cc:155 Execute] Begin execution
2022-01-07 07:17:51.915486867 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:16 (requested) num_bytes: 19888128 (actual) rounded_bytes:19888128
2022-01-07 07:17:51.916460004 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 33554432 bytes.
2022-01-07 07:17:51.916473358 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 34603008
2022-01-07 07:17:51.916478883 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7fa4c4000000 to 0x7fa4c6000000
2022-01-07 07:17:51.968223778 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.088785964 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for Cuda. bin_num:14 (requested) num_bytes: 4825088 (actual) rounded_bytes:4825088
2022-01-07 07:17:53.089515990 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 33554432 bytes.
2022-01-07 07:17:53.089527911 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 68157440
2022-01-07 07:17:53.089534390 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7fa492000000 to 0x7fa494000000
2022-01-07 07:17:53.089938002 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.162083719 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for CUDA_CPU. bin_num:0 (requested) num_bytes: 32 (actual) rounded_bytes:256
2022-01-07 07:17:53.162112760 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1048576 bytes.
2022-01-07 07:17:53.162119377 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 1048576
2022-01-07 07:17:53.162125287 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7fa456649a80 to 0x7fa456749a80
2022-01-07 07:17:53.164154593 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.169864812 [I:onnxruntime:log, bfc_arena.cc:306 AllocateRawInternal] Extending BFCArena for CudaPinned. bin_num:0 (requested) num_bytes: 16 (actual) rounded_bytes:256
2022-01-07 07:17:53.169924722 [I:onnxruntime:log, bfc_arena.cc:186 Extend] Extended allocation by 1048576 bytes.
2022-01-07 07:17:53.169931112 [I:onnxruntime:log, bfc_arena.cc:189 Extend] Total allocated bytes: 1048576
2022-01-07 07:17:53.169936752 [I:onnxruntime:log, bfc_arena.cc:192 Extend] Allocated memory at 0x7fa536e00400 to 0x7fa536f00400
2022-01-07 07:17:53.170068013 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.409086735 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.412016989 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.414773028 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.613093211 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.624830871 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.636706915 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.760878275 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.768211342 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.774060870 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.892082624 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.902492334 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:53.924870691 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.050035890 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.053538891 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.056429109 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.182095031 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.184785148 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.186824910 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.330282315 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.350548533 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.353424064 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.482296545 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.484997936 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.487071625 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.607597040 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.610282649 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.612668352 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.733027130 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.736369952 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.739264251 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.857939191 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.861231945 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:54.912998291 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
2022-01-07 07:17:55.031050395 [I:onnxruntime:log, bfc_arena.cc:256 Reserve] Reserving memory in BFCArena for Cuda size: 33554432
I0107 07:17:55.133362 48160 infer_response.cc:165] add response output: output: beam_log_probs, type: FP32, shape: [1,124,10]
I0107 07:17:55.133413 48160 grpc_server.cc:2286] GRPC: using buffer for 'beam_log_probs', size: 4960, addr: 0x7fa455abc130
I0107 07:17:55.133427 48160 infer_response.cc:165] add response output: output: beam_log_probs_idx, type: INT64, shape: [1,124,10]
I0107 07:17:55.133435 48160 grpc_server.cc:2286] GRPC: using buffer for 'beam_log_probs_idx', size: 9920, addr: 0x7fa455abd4a0
I0107 07:17:55.133444 48160 infer_response.cc:165] add response output: output: encoder_out_lens, type: INT32, shape: [1]
I0107 07:17:55.133450 48160 grpc_server.cc:2286] GRPC: using buffer for 'encoder_out_lens', size: 4, addr: 0x7fa455ab8200
I0107 07:17:55.133460 48160 grpc_server.cc:3310] ModelInferHandler::InferResponseComplete, 1 step ISSUED
I0107 07:17:55.133482 48160 grpc_server.cc:2321] GRPC free: size 4960, addr 0x7fa455abc130
I0107 07:17:55.133487 48160 grpc_server.cc:2321] GRPC free: size 9920, addr 0x7fa455abd4a0
I0107 07:17:55.133491 48160 grpc_server.cc:2321] GRPC free: size 4, addr 0x7fa455ab8200
I0107 07:17:55.133617 48160 grpc_server.cc:2879] ModelInferHandler::InferRequestComplete
I0107 07:17:55.133633 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 1 step COMPLETEI0107 07:17:55.133680 48160 onnxruntime.cc:2325] model encoder, instance encoder_0, executing 2 requests
I0107 07:17:55.133687 48160 onnxruntime.cc:1277] TRITONBACKEND_ModelExecute: Running encoder_0 with 2 requests

I0107 07:17:55.133739 48160 pinned_memory_manager.cc:161] pinned memory allocation: size 261760, addr 0x7fa6f8000090
I0107 07:17:55.133733 48160 grpc_server.cc:2195] Done for ModelInferHandler, 1
I0107 07:17:55.133953 48160 grpc_server.cc:3310] ModelInferHandler::InferResponseComplete, 4 step ISSUED
I0107 07:17:55.134012 48160 grpc_server.cc:2879] ModelInferHandler::InferRequestComplete
I0107 07:17:55.134021 48160 grpc_server.cc:3310] ModelInferHandler::InferResponseComplete, 5 step ISSUED
I0107 07:17:55.134026 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 4 step COMPLETE
I0107 07:17:55.134036 48160 grpc_server.cc:2195] Done for ModelInferHandler, 4
I0107 07:17:55.134048 48160 grpc_server.cc:2879] ModelInferHandler::InferRequestComplete
I0107 07:17:55.134070 48160 pinned_memory_manager.cc:190] pinned memory deallocation: addr 0x7fa6f8000090
I0107 07:17:55.134077 48160 grpc_server.cc:3157] Process for ModelInferHandler, rpc_ok=1, 5 step COMPLETE
I0107 07:17:55.134084 48160 grpc_server.cc:2195] Done for ModelInferHandler, 5
0107 07:21:34.168616 48276 pb_stub.cc:777] Non-graceful termination detected. 
0107 07:21:34.168644 48173 pb_stub.cc:777] Non-graceful termination detected. 

client error log: tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] onnx runtime error 2: not enough space: expected 270080, got 261760 Triton Information triton server 21.11-py

After I turn off max_queue_delay_microseconds, triton will execute these three requests in sequence, and everything is normal. However, after max_queue_delay_microseconds is configured, it seems that it finally forced onnx to handle two requests and pinned the wrong memory size. As can be seen from the log of the client, my second request shape is 142280, and the size is 270080, but onnx can't get enough requests. Because triton pin the memory size of 261760. This is very confusing.

deadeyegoodwin commented 2 years ago

@piekey1994 Can you provide repro steps, models, etc. so that we can reproduce this step.

deadeyegoodwin commented 2 years ago

FYI @askhade

S-GH commented 2 years ago

@piekey1994 I have also same issue

but the issue disappeared after fixing .pbtxt this line dynamic_batching { max_queue_delay_microseconds: 70000 } -> dynamic_batching { }

i dont know why this worked

please let me know how it works

my model is also encoder with different input data shape

and onnxruntime model

askhade commented 2 years ago

@piekey1994, @S-GH : Can you provide repro steps and the model to repro with.

piekey1994 commented 2 years ago

@piekey1994 I have also same issue

but the issue disappeared after fixing .pbtxt this line dynamic_batching { max_queue_delay_microseconds: 70000 } -> dynamic_batching { }

i dont know why this worked

please let me know how it works

my model is also encoder with different input data shape

and onnxruntime model After changing to dynamic_batching { }, although there is no error, the speed is very slow, much slower than pytorch backend. At present, I have given up using the onnx model. After transferring the torchsript format, the speed of using pytorch backend is normal and there is no error.

piekey1994 commented 2 years ago

@piekey1994, @S-GH : Can you provide repro steps and the model to repro with.

The model and configuration come from https://github.com/WENET-E2e/WENET/Tree/Main/Runtime/Server/x86 _ GPU.

piekey1994 commented 2 years ago

@piekey1994, @S-GH : Can you provide repro steps and the model to repro with.

@piekey1994 Can you provide repro steps, models, etc. so that we can reproduce this step.

I shared my model and code in https://www.dropbox.com/s/8rjtrnod0xyt305/debug.zip?dl=0 1.sh run_log.sh 2.python debug_encoder.py After several rounds of operation, bugs may appear.