triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.28k stars 1.47k forks source link

Ensemble Scheduler: Internal response allocation is not allocating memory at all #7593

Closed gpadiolleau closed 5 days ago

gpadiolleau commented 2 months ago

Description Individual models works as expected but ensemble pipeline of these individuals raise [StatusCode.INTERNAL] in ensemble 'depthcomp_pipeline', onnx runtime error 2: not enough space: expected 1048576, got 0.

Where the expected 1048576 byte size exactly matches the byte size of my rgbd_img (1x4x512x512 = 1048576) ONNX model first input (see below the config file of depthcomp_model).

Triton Information I am using Triton container: nvcr.io/nvidia/tritonserver:23.12-py3

Host Nvidia Driver version: 545.29.06 Host CUDA version: 12.3

HW: Nvidia GeForce RTX4060 8Gb

To Reproduce I am using gRPC inference (with shared memory) for a pipeline ensemble called depthcomp_pipeline and composed of three models : depthcomp_preprocessing, depthcomp_model and depthcomp_postprocessing.

-> Note that each individual model works separately and I can run gRPC inference with shared memory for each. -> The exact same configuration works perfectly in nvcr.io/nvidia/tritonserver:23.02-py3 (I just changed the custom python backend to python3.10)

Here are the config file used:

name: "depthcomp_preprocessing"
backend: "python"
max_batch_size: 0
input [
{
    name: "rgb_preproc_input"
    data_type: TYPE_UINT8
    dims: [ 1, -1, -1, 3 ]
},
{
    name: "depth_preproc_input"
    data_type: TYPE_FP32
    dims: [ 1, -1, -1 ]
}
]

output [
{
    name: "rgbd_preproc_output"
    data_type: TYPE_FP32
    dims: [ -1, 4, -1, -1 ]
},
{
    name: "holes_mask_output"
    data_type: TYPE_FP32
    dims: [ -1, 1, -1, -1]
},
{
    name: "ori_shape"
    data_type: TYPE_INT64
    dims: [ 1, 2 ]
},
{
    name: "ori_minmax"
    data_type: TYPE_FP32
    dims: [ 1, 2 ]
}
]

instance_group {
  count: 1
  kind: KIND_GPU
}

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/../triton_process.tar.gz"}
}

name: "depthcomp_model"
platform: "onnxruntime_onnx"
backend:"onnxruntime"
max_batch_size : 0

input [
  {
    name: "rgbd_img"
    data_type: TYPE_FP32
    dims: [ 1, 4, -1, -1 ]
  },
  {
    name: "holes_mask"
    data_type: TYPE_FP32
    dims: [ 1, 1, -1, -1 ]
  }
]

output [
  {
    name: "depth_output"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, -1 ]
  }
]

instance_group {
  count: 1
  kind: KIND_GPU
}
name: "depthcomp_postprocessing"
backend: "python"
max_batch_size: 0
input [

    {
        name: "depth_orishape"
        data_type: TYPE_INT64
        dims: [ 1, 2 ]
    },
    {
        name: "depth_oriminmax"
        data_type: TYPE_FP32
        dims: [ 1, 2 ]
    },
    {
        name: "ori_depth_input"
        data_type: TYPE_FP32
        dims: [ -1, -1, -1 ]
    },
    {
        name: "corr_depth_input"
        data_type: TYPE_FP32
        dims: [ -1, 1, -1, -1 ]
    }
]

output [
    {
        name: "corr_depth_output"
        data_type: TYPE_FP32
        dims: [ 1, -1, -1]
    }
]

instance_group {
  count: 1
  kind: KIND_GPU
}

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/../triton_process.tar.gz"}
}
name: "depthcomp_pipeline"
platform: "ensemble"
max_batch_size: 0

input [
  {
    name: "rgb_img"
    data_type: TYPE_UINT8
    dims: [ 1, -1, -1, 3 ]
  },
  {
    name: "depth_img"
    data_type: TYPE_FP32
    dims: [ 1, -1, -1 ]
  }
]

output [
    {
        name: "corr_depth_output"
        data_type: TYPE_FP32
        dims: [ 1, -1, -1]
    }
]

ensemble_scheduling {
  step [
    {
      model_name: "depthcomp_preprocessing"
      model_version: -1

      input_map {
        key: "rgb_preproc_input"
        value: "rgb_img"
      }
            input_map {
        key: "depth_preproc_input"
        value: "depth_img"
      }

            output_map {
        key: "rgbd_preproc_output"
        value: "rgbd_input"
      }
        output_map {
            key: "holes_mask_output"
            value: "mask_input"
        }
        output_map {
        key: "ori_shape"
        value: "depth_ori_shape"
      }
            output_map {
        key: "ori_minmax"
        value: "depth_ori_minmax"
      }
    },

    {
        model_name: "depthcomp_model"
        model_version: -1

        input_map {
            key: "rgbd_img"
            value: "rgbd_input"
        }
        input_map {
            key: "holes_mask"
            value: "mask_input"
        }

        output_map {
            key: "depth_output"
            value: "corr_depth"
        }
    },

    {
        model_name: "depthcomp_postprocessing"
        model_version: -1

        input_map {
            key: "depth_orishape"
            value: "depth_ori_shape"
        }
        input_map {
            key: "depth_oriminmax"
            value: "depth_ori_minmax"
        }
        input_map {
            key: "ori_depth_input"
            value: "depth_img"
        }
        input_map {
            key: "corr_depth_input"
            value: "corr_depth"
        }

        output_map {
            key: "corr_depth_output"
            value: "corr_depth_output"
        }
    }
  ]
}

Expected behavior I would expect the ensemble to work properly since each individual model works but only the ensemble does not work. I didn't find any change that could cause this error in the release notes but I may have missed something. In this case, thanks to point this me out.

gpadiolleau commented 1 month ago

Update

I updated to nvcr.io/nvidia/tritonserver:24.08-py3 with Nvidia driver 560.35.03 and CUDA 12.6, but get the same problem.

Going deeper into the logs to found that:

I0905 12:18:31.251273 30850 infer_request.cc:905] "[request id: 240905-141831250860] prepared: [0x0x758764030910] request id: 240905-141831250860, model: depthcomp_preprocessing, requested version: -1, actual version: 1, flags: 0x0, correlation id: 0, batch size: 0, priority: 0, timeout (us): 0\noriginal inputs:\n[0x0x758764030f58] input: rgb_preproc_input, type: UINT8, original shape: [1,640,640,3], batch + shape: [1,640,640,3], shape: [1,640,640,3]\n[0x0x75876401a9b8] input: depth_preproc_input, type: FP32, original shape: [1,640,640], batch + shape: [1,640,640], shape: [1,640,640]\noverride inputs:\ninputs:\n[0x0x75876401a9b8] input: depth_preproc_input, type: FP32, original shape: [1,640,640], batch + shape: [1,640,640], shape: [1,640,640]\n[0x0x758764030f58] input: rgb_preproc_input, type: UINT8, original shape: [1,640,640,3], batch + shape: [1,640,640,3], shape: [1,640,640,3]\noriginal requested outputs:\nholes_mask_output\nori_minmax\nori_shape\nrgbd_preproc_output\nrequested outputs:\nholes_mask_output\nori_minmax\nori_shape\nrgbd_preproc_output\n"
I0905 12:18:31.251287 30850 infer_request.cc:132] "[request id: 240905-141831250860] Setting state from INITIALIZED to PENDING"
I0905 12:18:31.251299 30850 infer_handler.h:1360] "Returning from ModelInferHandler, 0, ISSUED"
I0905 12:18:31.251309 30850 infer_request.cc:132] "[request id: 240905-141831250860] Setting state from PENDING to EXECUTING"
I0905 12:18:31.251322 30850 python_be.cc:1209] "model depthcomp_preprocessing, instance depthcomp_preprocessing_0_0, executing 1 requests"
I0905 12:18:31.261439 30850 infer_response.cc:174] "add response output: output: rgbd_preproc_output, type: FP32, shape: [1,4,512,512]"
I0905 12:18:31.261465 30850 ensemble_scheduler.cc:569] "Internal response allocation: rgbd_preproc_output, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261468 30850 infer_response.cc:174] "add response output: output: holes_mask_output, type: FP32, shape: [1,1,512,512]"
I0905 12:18:31.261470 30850 ensemble_scheduler.cc:569] "Internal response allocation: holes_mask_output, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261473 30850 infer_response.cc:174] "add response output: output: ori_shape, type: INT64, shape: [1,2]"
I0905 12:18:31.261475 30850 ensemble_scheduler.cc:569] "Internal response allocation: ori_shape, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261476 30850 infer_response.cc:174] "add response output: output: ori_minmax, type: FP32, shape: [1,2]"
I0905 12:18:31.261480 30850 ensemble_scheduler.cc:569] "Internal response allocation: ori_minmax, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261519 30850 infer_handler.cc:1012] "ModelInferHandler::InferResponseComplete, 0 step ISSUED"
I0905 12:18:31.261901 30850 infer_handler.h:1350] "Received notification for ModelInferHandler, 0"
I0905 12:18:31.261908 30850 infer_handler.h:1353] "Grpc::CQ::Next() Running state_id 0\n\tContext step 0 id 0\n\t\t State id 0: State step 1\n"
I0905 12:18:31.261914 30850 infer_handler.cc:728] "Process for ModelInferHandler, rpc_ok=1, 0 step COMPLETE"
I0905 12:18:31.261917 30850 infer_handler.h:1360] "Returning from ModelInferHandler, 0, FINISH"
I0905 12:18:31.261920 30850 infer_handler.h:1353] "Grpc::CQ::Next() Running state_id 0\n\tContext step 0 id 0\n\t\t State id 0: State step 2\n"
I0905 12:18:31.261922 30850 infer_handler.cc:728] "Process for ModelInferHandler, rpc_ok=1, 0 step FINISH"
I0905 12:18:31.261924 30850 infer_handler.h:1356] "Done for ModelInferHandler, 0"
I0905 12:18:31.261926 30850 infer_handler.h:1251] "StateRelease, 0 Step FINISH"
I0905 12:18:31.262230 30850 infer_request.cc:132] "[request id: 240905-141831250860] Setting state from EXECUTING to RELEASED"
I0905 12:18:31.262239 30850 infer_request.cc:132] "[request id: 240905-141831250860] Setting state from EXECUTING to RELEASED"
I0905 12:18:31.262241 30850 infer_handler.cc:647] "ModelInferHandler::InferRequestComplete"
I0905 12:18:31.262247 30850 python_be.cc:2043] "TRITONBACKEND_ModelInstanceExecute: model instance name depthcomp_preprocessing_0_0 released 1 requests"

Particularly:

I0905 12:18:31.261439 30850 infer_response.cc:174] "add response output: output: rgbd_preproc_output, type: FP32, shape: [1,4,512,512]"
I0905 12:18:31.261465 30850 ensemble_scheduler.cc:569] "Internal response allocation: rgbd_preproc_output, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261468 30850 infer_response.cc:174] "add response output: output: holes_mask_output, type: FP32, shape: [1,1,512,512]"
I0905 12:18:31.261470 30850 ensemble_scheduler.cc:569] "Internal response allocation: holes_mask_output, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261473 30850 infer_response.cc:174] "add response output: output: ori_shape, type: INT64, shape: [1,2]"
I0905 12:18:31.261475 30850 ensemble_scheduler.cc:569] "Internal response allocation: ori_shape, size 0, addr 0, memory type 0, type id 0"
I0905 12:18:31.261476 30850 infer_response.cc:174] "add response output: output: ori_minmax, type: FP32, shape: [1,2]"
I0905 12:18:31.261480 30850 ensemble_scheduler.cc:569] "Internal response allocation: ori_minmax, size 0, addr 0, memory type 0, type id 0"

It seems that the ensemble scheduler is not able to allocate memory to internal response (all is zero...)

I don't know if I have to allocate memory myself or not ...

gpadiolleau commented 1 month ago

Update

Still in nvcr.io/nvidia/tritonserver:24.08-py3 with Nvidia driver 560.35.03 and CUDA 12.6.

I converted my ensemble model into BLS, to check if it is really the ensemble scheduler that causes the issue, but still get the same problem:

[StatusCode.INTERNAL] Failed to process the request(s) for model 'depthcomp_bls_0_0', message: TritonModelException: 
Model depthcomp_model - Error when running inference:
[request id: <id_unknown>] input byte size mismatch for input 'holes_mask' for model 'depthcomp_model'. Expected 1048576, got 0

At:
  /models/py/depthcomp_bls/1/model.py(147): execute

Any tutorials about ensemble scheduling nor BLS talk about a memory allocation for models "in the middle". I don't know what to do with that, it seems only to work with nvcr.io/nvidia/tritonserver:23.02-py3 and i dont know why.

I found issues #4478, #95 with somehow the same errors but the solutions do not apply here because individual models work perfectly.

@Tabrizian @szalpal, sorry for tagging but any help would be appreciated on this since I think I just missed something...

gpadiolleau commented 1 month ago

I just made further tests to try to understand.

I just built a VERY simple ensemble with two python models simple_add.py and simple_sub.py.

name: "simple_add"
backend: "python"
max_batch_size: 0
input [
{
    name: "A"
    data_type: TYPE_UINT8
    dims: [ 1, 1]
},
{
    name: "B"
    data_type: TYPE_UINT8
    dims: [ 1, 1 ]
}
]

output [
{
    name: "C"
    data_type: TYPE_UINT8
    dims: [ 1, 1 ]
}
]

instance_group {
  count: 1
  kind: KIND_GPU
}

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/../triton_process.tar.gz"}
}
name: "simple_sub"
backend: "python"
max_batch_size: 0
input [
{
    name: "A"
    data_type: TYPE_UINT8
    dims: [ 1, 1]
},
{
    name: "B"
    data_type: TYPE_UINT8
    dims: [ 1, 1 ]
}
]

output [
{
    name: "C"
    data_type: TYPE_UINT8
    dims: [ 1, 1 ]
}
]

instance_group {
  count: 1
  kind: KIND_GPU
}

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/../triton_process.tar.gz"}
}
name: "simple_ensemble"
platform: "ensemble"
max_batch_size: 0

input [
  {
    name: "A"
    data_type: TYPE_UINT8
    dims: [ 1, 1]
  },
  {
    name: "B"
    data_type: TYPE_UINT8
    dims: [ 1, 1 ]
  }
]

output [
    {
        name: "output"
        data_type: TYPE_UINT8
        dims: [ 1, 1]
    }
]

ensemble_scheduling {
  step [
    {
      model_name: "simple_add"
      model_version: -1

      input_map {
        key: "A"
        value: "A"
      }
      input_map {
        key: "B"
        value: "B"
      }

      output_map {
        key: "C"
        value: "C"
      }
    },

    {
        model_name: "simple_sub"
        model_version: -1

        input_map {
            key: "A"
            value: "C"
        }
        input_map {
            key: "B"
            value: "B"
        }

        output_map {
            key: "C"
            value: "output"
        }
    }
  ]
}

And I got the same error...

As a desperate move, it tried just to remove the EXECUTION_ENV_PATH parameter in the two python models since :

-> The exact same configuration works perfectly in nvcr.io/nvidia/tritonserver:23.02-py3 (I just changed the custom python backend to python3.10)

By removing

parameters: {
  key: "EXECUTION_ENV_PATH",
  value: {string_value: "$$TRITON_MODEL_DIRECTORY/../triton_process.tar.gz"}
}

in both simple_add and simple_sub: It worked... 🤯 But individual models still work separately if i leave the EXECUTION_ENV_PATH parameter, thus it is not caused by the environment. MORE, if I just remove the EXECUTION_ENV_PATH parameter if the first model in the ensemble (i.e. simple_add), the whole ensemble model work too ! 🤯 🤯 It seems the issue is only caused when the EXECUTION_ENV_PATH parameter is set for models which are followed by another in ensemble..

topuzm15 commented 2 weeks ago

Is there any update for the allocation memory error? We also have a similar problem. We have 2 models running with onnx runtime and one bls, We can get response from models separately but when I try to send requests to the bls that ensemble the models with some logic it is not working.

the bls config:

name: "model-dcn-bls"
backend: "python"
max_batch_size: 2048

input [
  {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 120 ] 
  }
]

output [
  {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [1]
  }
]

instance_group [
  {
    kind: KIND_GPU
    count: 1
  }
]

error: onnx runtime error 2: not enough space: expected 452, got 0

gpadiolleau commented 2 weeks ago

Hi, unfortunately I never got any answer from the staff... Which triton version do you have ? For me it is only working with version 23.02 may be you could try with this one.

szalpal commented 2 weeks ago

Hi,

while I may not be the best person to help with this, I'll try my best.

Could you tell, how do you run the tritonserver? Particularly, if you are running it using docker, are you assigning sufficient --shmem-size?

gpadiolleau commented 2 weeks ago

Hi,

Could you tell, how do you run the tritonserver? Particularly, if you are running it using docker, are you assigning sufficient --shmem-size?

I'm using docker container, yes. I already tried changing the shm-size in docker, but from what I can remember it didn't change anything. I think the problem is more related to python backend in ensemble, cf my previous posts.

SDJustus commented 1 week ago

Just for visibility, running into the same issue on 24.09-py3 ... Internal response allocation: OUTPUT_1, size 0, addr 0, memory type 0, type id 0 ... input byte size mismatch for input 'POST_INPUT_1' for model '3rd_model'. Expected 16, got 0 ... POST_INPUT_1 matches OUTPUT_1 through ensemble config

Testing with 23.02-py3 now

SDJustus commented 1 week ago

@gpadiolleau this issue might be related #7647. It would explain the difference you are getting specifying and environment... Maybe your environment is compiled with numpy 2.0 or above, which seems to be not fully supported by the python backend. I am testing on my side as well.

gpadiolleau commented 1 week ago

@SDJustus nice catch !! I don't have time for now to test if it works. But I found out that effectively I have numpy 2.1.0 installed in the Conda env I packed as python_backend. Hopefully this will resolved this error, let us know if it works on your side.

SDJustus commented 1 week ago

@gpadiolleau downgrading numpy<2 fixed the issue!

gpadiolleau commented 5 days ago

I finally got time to test: downgrading numpy to 1.26 fixed the issue too.