Open piekey1994 opened 2 years ago
@piekey1994 Can you provide repro steps, models, etc. so that we can reproduce this step.
FYI @askhade
@piekey1994 I have also same issue
but the issue disappeared after fixing .pbtxt this line
dynamic_batching { max_queue_delay_microseconds: 70000 } -> dynamic_batching { }
i dont know why this worked
please let me know how it works
my model is also encoder with different input data shape
and onnxruntime model
@piekey1994, @S-GH : Can you provide repro steps and the model to repro with.
@piekey1994 I have also same issue
but the issue disappeared after fixing .pbtxt this line
dynamic_batching { max_queue_delay_microseconds: 70000 } -> dynamic_batching { }
i dont know why this worked
please let me know how it works
my model is also encoder with different input data shape
and onnxruntime model After changing to dynamic_batching { }, although there is no error, the speed is very slow, much slower than pytorch backend. At present, I have given up using the onnx model. After transferring the torchsript format, the speed of using pytorch backend is normal and there is no error.
@piekey1994, @S-GH : Can you provide repro steps and the model to repro with.
The model and configuration come from https://github.com/WENET-E2e/WENET/Tree/Main/Runtime/Server/x86 _ GPU.
@piekey1994, @S-GH : Can you provide repro steps and the model to repro with.
@piekey1994 Can you provide repro steps, models, etc. so that we can reproduce this step.
I shared my model and code in https://www.dropbox.com/s/8rjtrnod0xyt305/debug.zip?dl=0 1.sh run_log.sh 2.python debug_encoder.py After several rounds of operation, bugs may appear.
Description When I enabled max_queue_delay_microseconds to improve the response speed of the model, I found that there were occasional errors. I set max_queue_delay_microseconds to 70000. Then I sent three tensor of different lengths to the service at the same time. The first request was successful, and the other two requests failed. If I don't configure max_queue_delay_microseconds, it will always be successful. config.pbtxt:
triton log:
client error log:
tritonclient.utils.InferenceServerException: [StatusCode.INTERNAL] onnx runtime error 2: not enough space: expected 270080, got 261760
Triton Information triton server 21.11-pyAfter I turn off max_queue_delay_microseconds, triton will execute these three requests in sequence, and everything is normal. However, after max_queue_delay_microseconds is configured, it seems that it finally forced onnx to handle two requests and pinned the wrong memory size. As can be seen from the log of the client, my second request shape is 142280, and the size is 270080, but onnx can't get enough requests. Because triton pin the memory size of 261760. This is very confusing.