Closed gpadiolleau closed 5 days ago
Update
I updated to nvcr.io/nvidia/tritonserver:24.08-py3
with Nvidia driver 560.35.03
and CUDA 12.6
, but get the same problem.
Going deeper into the logs to found that:
I0905 12:18:31.251273 30850 infer_request.cc:905] "[request id: 240905-141831250860] prepared: [0x0x758764030910] request id: 240905-141831250860, model: depthcomp_preprocessing, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0\noriginal inputs:\n[0x0x758764030f58] input: rgb_preproc_input, type: UINT8, original shape: [1,640,640,3], batch + shape: [1,640,640,3], shape: [1,640,640,3]\n[0x0x75876401a9b8] input: depth_preproc_input, type: FP32, original shape: [1,640,640], batch + shape: [1,640,640], shape: [1,640,640]\noverride inputs:\ninputs:\n[0x0x75876401a9b8] input: depth_preproc_input, type: FP32, original shape: [1,640,640], batch + shape: [1,640,640], shape: [1,640,640]\n[0x0x758764030f58] input: rgb_preproc_input, type: UINT8, original shape: [1,640,640,3], batch + shape: [1,640,640,3], shape: [1,640,640,3]\noriginal requested outputs:\nholes_mask_output\nori_minmax\nori_shape\nrgbd_preproc_output\nrequested outputs:\nholes_mask_output\nori_minmax\nori_shape\nrgbd_preproc_output\n"
I0905 12:18:31.251287 30850 infer_request.cc:132] "[request id: 240905-141831250860] Setting state from INITIALIZED to PENDING"
I0905 12:18:31.251299 30850 infer_handler.h:1360] "Returning from ModelInferHandler, 0, ISSUED"
I0905 12:18:31.251309 30850 infer_request.cc:132] "[request id: 240905-141831250860] Setting state from PENDING to EXECUTING"
I0905 12:18:31.251322 30850 python_be.cc:1209] "model depthcomp_preprocessing, instance depthcomp_preprocessing_0_0, executing 1 requests"
I0905 12:18:31.261439 30850 infer_response.cc:174] "add response output: output: rgbd_preproc_output, type: FP32, shape: [1,4,512,512]"
I0905 12:18:31.261465 30850 ensemble_scheduler.cc:569] "Internal response allocation: rgbd_preproc_output, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261468 30850 infer_response.cc:174] "add response output: output: holes_mask_output, type: FP32, shape: [1,1,512,512]"
I0905 12:18:31.261470 30850 ensemble_scheduler.cc:569] "Internal response allocation: holes_mask_output, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261473 30850 infer_response.cc:174] "add response output: output: ori_shape, type: INT64, shape: [1,2]"
I0905 12:18:31.261475 30850 ensemble_scheduler.cc:569] "Internal response allocation: ori_shape, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261476 30850 infer_response.cc:174] "add response output: output: ori_minmax, type: FP32, shape: [1,2]"
I0905 12:18:31.261480 30850 ensemble_scheduler.cc:569] "Internal response allocation: ori_minmax, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261519 30850 infer_handler.cc:1012] "ModelInferHandler::InferResponseComplete, 0 step ISSUED"
I0905 12:18:31.261901 30850 infer_handler.h:1350] "Received notification for ModelInferHandler, 0"
I0905 12:18:31.261908 30850 infer_handler.h:1353] "Grpc::CQ::Next() Running state_id 0\n\tContext step 0 id 0\n\t\t State id 0: State step 1\n"
I0905 12:18:31.261914 30850 infer_handler.cc:728] "Process for ModelInferHandler, rpc_ok=1, 0 step COMPLETE"
I0905 12:18:31.261917 30850 infer_handler.h:1360] "Returning from ModelInferHandler, 0, FINISH"
I0905 12:18:31.261920 30850 infer_handler.h:1353] "Grpc::CQ::Next() Running state_id 0\n\tContext step 0 id 0\n\t\t State id 0: State step 2\n"
I0905 12:18:31.261922 30850 infer_handler.cc:728] "Process for ModelInferHandler, rpc_ok=1, 0 step FINISH"
I0905 12:18:31.261924 30850 infer_handler.h:1356] "Done for ModelInferHandler, 0"
I0905 12:18:31.261926 30850 infer_handler.h:1251] "StateRelease, 0 Step FINISH"
I0905 12:18:31.262230 30850 infer_request.cc:132] "[request id: 240905-141831250860] Setting state from EXECUTING to RELEASED"
I0905 12:18:31.262239 30850 infer_request.cc:132] "[request id: 240905-141831250860] Setting state from EXECUTING to RELEASED"
I0905 12:18:31.262241 30850 infer_handler.cc:647] "ModelInferHandler::InferRequestComplete"
I0905 12:18:31.262247 30850 python_be.cc:2043] "TRITONBACKEND_ModelInstanceExecute: model instance name depthcomp_preprocessing_0_0 released 1 requests"
Particularly:
I0905 12:18:31.261439 30850 infer_response.cc:174] "add response output: output: rgbd_preproc_output, type: FP32, shape: [1,4,512,512]"
I0905 12:18:31.261465 30850 ensemble_scheduler.cc:569] "Internal response allocation: rgbd_preproc_output, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261468 30850 infer_response.cc:174] "add response output: output: holes_mask_output, type: FP32, shape: [1,1,512,512]"
I0905 12:18:31.261470 30850 ensemble_scheduler.cc:569] "Internal response allocation: holes_mask_output, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261473 30850 infer_response.cc:174] "add response output: output: ori_shape, type: INT64, shape: [1,2]"
I0905 12:18:31.261475 30850 ensemble_scheduler.cc:569] "Internal response allocation: ori_shape, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261476 30850 infer_response.cc:174] "add response output: output: ori_minmax, type: FP32, shape: [1,2]"
I0905 12:18:31.261480 30850 ensemble_scheduler.cc:569] "Internal response allocation: ori_minmax, size 0, addr 0, memory type 0, type id 0"
It seems that the ensemble scheduler is not able to allocate memory to internal response (all is zero...)
I don't know if I have to allocate memory myself or not ...
Update
Still in nvcr.io/nvidia/tritonserver:24.08-py3
with Nvidia driver 560.35.03
and CUDA 12.6
.
I converted my ensemble model into BLS, to check if it is really the ensemble scheduler that causes the issue, but still get the same problem:
[StatusCode.INTERNAL] Failed to process the request(s) for model 'depthcomp_bls_0_0', message: TritonModelException:
Model depthcomp_model - Error when running inference:
[request id: <id_unknown>] input byte size mismatch for input 'holes_mask' for model 'depthcomp_model'. Expected 1048576, got 0
At:
/models/py/depthcomp_bls/1/model.py(147): execute
Any tutorials about ensemble scheduling nor BLS talk about a memory allocation for models "in the middle".
I don't know what to do with that, it seems only to work with nvcr.io/nvidia/tritonserver:23.02-py3
and i dont know why.
I found issues #4478, #95 with somehow the same errors but the solutions do not apply here because individual models work perfectly.
@Tabrizian @szalpal, sorry for tagging but any help would be appreciated on this since I think I just missed something...
I just made further tests to try to understand.
I just built a VERY simple ensemble with two python models simple_add.py
and simple_sub.py
.
name: "simple_add"
backend: "python"
max_batch_size: 0
input [
{
name: "A"
data_type: TYPE_UINT8
dims: [ 1, 1]
},
{
name: "B"
data_type: TYPE_UINT8
dims: [ 1, 1 ]
}
]
output [
{
name: "C"
data_type: TYPE_UINT8
dims: [ 1, 1 ]
}
]
instance_group {
count: 1
kind: KIND_GPU
}
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "$$TRITON_MODEL_DIRECTORY/../triton_process.tar.gz"}
}
name: "simple_sub"
backend: "python"
max_batch_size: 0
input [
{
name: "A"
data_type: TYPE_UINT8
dims: [ 1, 1]
},
{
name: "B"
data_type: TYPE_UINT8
dims: [ 1, 1 ]
}
]
output [
{
name: "C"
data_type: TYPE_UINT8
dims: [ 1, 1 ]
}
]
instance_group {
count: 1
kind: KIND_GPU
}
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "$$TRITON_MODEL_DIRECTORY/../triton_process.tar.gz"}
}
name: "simple_ensemble"
platform: "ensemble"
max_batch_size: 0
input [
{
name: "A"
data_type: TYPE_UINT8
dims: [ 1, 1]
},
{
name: "B"
data_type: TYPE_UINT8
dims: [ 1, 1 ]
}
]
output [
{
name: "output"
data_type: TYPE_UINT8
dims: [ 1, 1]
}
]
ensemble_scheduling {
step [
{
model_name: "simple_add"
model_version: -1
input_map {
key: "A"
value: "A"
}
input_map {
key: "B"
value: "B"
}
output_map {
key: "C"
value: "C"
}
},
{
model_name: "simple_sub"
model_version: -1
input_map {
key: "A"
value: "C"
}
input_map {
key: "B"
value: "B"
}
output_map {
key: "C"
value: "output"
}
}
]
}
And I got the same error...
As a desperate move, it tried just to remove the EXECUTION_ENV_PATH
parameter in the two python models
since :
-> The exact same configuration works perfectly in nvcr.io/nvidia/tritonserver:23.02-py3 (I just changed the custom python backend to python3.10)
By removing
parameters: {
key: "EXECUTION_ENV_PATH",
value: {string_value: "$$TRITON_MODEL_DIRECTORY/../triton_process.tar.gz"}
}
in both simple_add
and simple_sub
:
It worked... 🤯
But individual models still work separately if i leave the EXECUTION_ENV_PATH
parameter, thus it is not caused by the environment.
MORE, if I just remove the EXECUTION_ENV_PATH
parameter if the first model in the ensemble (i.e. simple_add
), the whole ensemble model work too ! 🤯 🤯
It seems the issue is only caused when the EXECUTION_ENV_PATH
parameter is set for models which are followed by another in ensemble..
Is there any update for the allocation memory error? We also have a similar problem. We have 2 models running with onnx runtime and one bls, We can get response from models separately but when I try to send requests to the bls that ensemble the models with some logic it is not working.
the bls config:
name: "model-dcn-bls"
backend: "python"
max_batch_size: 2048
input [
{
name: "input__0"
data_type: TYPE_FP32
dims: [ 120 ]
}
]
output [
{
name: "output__0"
data_type: TYPE_FP32
dims: [1]
}
]
instance_group [
{
kind: KIND_GPU
count: 1
}
]
error:
onnx runtime error 2: not enough space: expected 452, got 0
Hi, unfortunately I never got any answer from the staff... Which triton version do you have ? For me it is only working with version 23.02 may be you could try with this one.
Hi,
while I may not be the best person to help with this, I'll try my best.
Could you tell, how do you run the tritonserver
? Particularly, if you are running it using docker
, are you assigning sufficient --shmem-size
?
Hi,
Could you tell, how do you run the
tritonserver
? Particularly, if you are running it usingdocker
, are you assigning sufficient--shmem-size
?
I'm using docker container, yes. I already tried changing the shm-size
in docker, but from what I can remember it didn't change anything. I think the problem is more related to python backend in ensemble, cf my previous posts.
Just for visibility, running into the same issue on 24.09-py3 ... Internal response allocation: OUTPUT_1, size 0, addr 0, memory type 0, type id 0 ... input byte size mismatch for input 'POST_INPUT_1' for model '3rd_model'. Expected 16, got 0 ... POST_INPUT_1 matches OUTPUT_1 through ensemble config
Testing with 23.02-py3 now
@gpadiolleau this issue might be related #7647. It would explain the difference you are getting specifying and environment... Maybe your environment is compiled with numpy 2.0 or above, which seems to be not fully supported by the python backend. I am testing on my side as well.
@SDJustus nice catch !! I don't have time for now to test if it works. But I found out that effectively I have numpy 2.1.0 installed in the Conda env I packed as python_backend. Hopefully this will resolved this error, let us know if it works on your side.
@gpadiolleau downgrading numpy<2 fixed the issue!
I finally got time to test: downgrading numpy to 1.26 fixed the issue too.
Description Individual models works as expected but ensemble pipeline of these individuals raise
[StatusCode.INTERNAL] in ensemble 'depthcomp_pipeline', onnx runtime error 2: not enough space: expected 1048576, got 0
.Where the expected
1048576
byte size exactly matches the byte size of myrgbd_img
(1x4x512x512 = 1048576
) ONNX model first input (see below the config file of depthcomp_model).Triton Information I am using Triton container:
nvcr.io/nvidia/tritonserver:23.12-py3
Host Nvidia Driver version:
545.29.06
Host CUDA version:12.3
HW: Nvidia GeForce RTX4060 8Gb
To Reproduce I am using gRPC inference (with shared memory) for a pipeline ensemble called depthcomp_pipeline and composed of three models : depthcomp_preprocessing, depthcomp_model and depthcomp_postprocessing.
-> Note that each individual model works separately and I can run gRPC inference with shared memory for each. -> The exact same configuration works perfectly in
nvcr.io/nvidia/tritonserver:23.02-py3
(I just changed the custom python backend to python3.10)Here are the config file used:
Expected behavior I would expect the ensemble to work properly since each individual model works but only the ensemble does not work. I didn't find any change that could cause this error in the release notes but I may have missed something. In this case, thanks to point this me out.