triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.15k stars 1.46k forks source link

Conflicting error messages for batching mode in python backend #6832

Open briedel opened 8 months ago

briedel commented 8 months ago

When inputting a single input array with

inputs = [ httpclient.InferInput("input_branch1",
                                          self.model_input_shape,
                                          "FP32") ]
outputs = [  httpclient.InferRequestedOutput("Target1") ]

inputs[0].set_data_from_numpy(input_data[0].astype(np.single))

I get the following error:

tritonclient.utils.InferenceServerException: [400] [request id: <id_unknown>] unexpected shape for input 'input_branch1' for model 'tglauch_classifier'. Expected [-1,10,10,60,16], got [10,10,60,16]. NOTE: Setting a non-zero max_batch_size in the model config requires a batch dimension to be prepended to each input shape. If you want to specify the full shape including the batch dim in your input dims config, try setting max_batch_size to zero. See the model configuration docs for more info on max_batch_size.

When passing a batched input array:

inputs = [ httpclient.InferInput("input_branch1",
                                          self.model_input_shape,
                                          "FP32") ]
outputs = [  httpclient.InferRequestedOutput("Target1") ]

inputs[0].set_data_from_numpy(input_data.astype(np.single))

I get this error:

tritonclient.utils.InferenceServerException: got unexpected numpy array shape [48, 10, 10, 60, 16], expected [10, 10, 60, 16]

The config.pbtxt is:

name: "name"
platform: "onnxruntime_onnx"
max_batch_size: 64

and the autogenerated config.pbtxt is { "name": "name", "platform": "onnxruntime_onnx", "backend": "onnxruntime", "version_policy": { "latest": { "num_versions": 1 } }, "max_batch_size": 64, "input": [ { "name": "input_branch1", "data_type": "TYPE_FP32", "dims": [ 10, 10, 60, 16 ] } ], "output": [ { "name": "Target1", "data_type": "TYPE_FP32", "dims": [ 5 ] } ], "batch_input": [], "batch_output": [], "optimization": { "priority": "PRIORITY_DEFAULT", "input_pinned_memory": { "enable": true }, "output_pinned_memory": { "enable": true }, "gather_kernel_buffer_threshold": 0, "eager_batching": false }, "instance_group": [ { "name": "tglauch_classifier", "kind": "KIND_GPU", "count": 1, "gpus": [ 0 ], "secondary_devices": [], "profile": [], "passive": false, "host_policy": "" } ], "default_model_filename": "model.onnx", "cc_model_filenames": {}, "metric_tags": {}, "parameters": {}, "model_warmup": [], "dynamic_batching": {} }

when looking at config Model Config: {'name': 'name', 'platform': 'onnxruntime_onnx', 'backend': 'onnxruntime', 'version_policy': {'latest': {'num_versions': 1}}, 'max_batch_size': 64, 'input': [{'name': 'input_branch1', 'data_type': 'TYPE_FP32', 'format': 'FORMAT_NONE', 'dims': [10, 10, 60, 16], 'is_shape_tensor': False, 'allow_ragged_batch': False, 'optional': False}], 'output': [{'name': 'Target1', 'data_type': 'TYPE_FP32', 'dims': [5], 'label_filename': '', 'is_shape_tensor': False}], 'batch_input': [], 'batch_output': [], 'optimization': {'priority': 'PRIORITY_DEFAULT', 'input_pinned_memory': {'enable': True}, 'output_pinned_memory': {'enable': True}, 'gather_kernel_buffer_threshold': 0, 'eager_batching': False}, 'dynamic_batching': {'preferred_batch_size': [64], 'max_queue_delay_microseconds': 0, 'preserve_ordering': False, 'priority_levels': 0, 'default_priority_level': 0, 'priority_queue_policy': {}}, 'instance_group': [{'name': 'tglauch_classifier', 'kind': 'KIND_GPU', 'count': 1, 'gpus': [0], 'secondary_devices': [], 'profile': [], 'passive': False, 'host_policy': ''}], 'default_model_filename': 'model.onnx', 'cc_model_filenames': {}, 'metric_tags': {}, 'parameters': {}, 'model_warmup': []}

and the metadata Model Metadata: {'name': 'name', 'versions': ['1'], 'platform': 'onnxruntime_onnx', 'inputs': [{'name': 'input_branch1', 'datatype': 'FP32', 'shape': [-1, 10, 10, 60, 16]}], 'outputs': [{'name': 'Target1', 'datatype': 'FP32', 'shape': [-1, 5]}]} (i3.py:518 in _configure_model)

Changing max_batch_size to 0 throws a different error:

tritonclient.utils.InferenceServerException: [400] [request id: <id_unknown>] inference request batch-size must be <= 4 for 'name'

I did query the config via REST and max_batch_size was changed to 4

Model Config: {'name': 'name', 'platform': 'onnxruntime_onnx', 'backend': 'onnxruntime', 'version_policy': {'latest': {'num_versions': 1}}, 'max_batch_size': 4, 'input': [{'name': 'input_branch1', 'data_type': 'TYPE_FP32', 'format': 'FORMAT_NONE', 'dims': [10, 10, 60, 16], 'is_shape_tensor': False, 'allow_ragged_batch': False, 'optional': False}], 'output': [{'name': 'Target1', 'data_type': 'TYPE_FP32', 'dims': [5], 'label_filename': '', 'is_shape_tensor': False}], 'batch_input': [], 'batch_output': [], 'optimization': {'priority': 'PRIORITY_DEFAULT', 'input_pinned_memory': {'enable': True}, 'output_pinned_memory': {'enable': True}, 'gather_kernel_buffer_threshold': 0, 'eager_batching': False}, 'dynamic_batching': {'preferred_batch_size': [4], 'max_queue_delay_microseconds': 0, 'preserve_ordering': False, 'priority_levels': 0, 'default_priority_level': 0, 'priority_queue_policy': {}}, 'instance_group': [{'name': 'name', 'kind': 'KIND_GPU', 'count': 1, 'gpus': [0], 'secondary_devices': [], 'profile': [], 'passive': False, 'host_policy': ''}], 'default_model_filename': 'model.onnx', 'cc_model_filenames': {}, 'metric_tags': {}, 'parameters': {}, 'model_warmup': []}

the config autocomplete changes the parameter max_batch_size to 4 from 0 in my config.pbtxt

I did notice that W0125 17:09:56.807240 1 onnxruntime.cc:813] autofilled max_batch_size to 4 for model 'tglauch_classifier' since batching is supporrted but no max_batch_size is specified in model configuration. Must specify max_batch_size to utilize autofill with a larger max batch size where the config was

name: "name"
platform: "onnxruntime_onnx"
max_batch_size: 0
jbkyang-nvi commented 8 months ago

Hello can you explain what kind of model you are using? Some models do not support dynamic batching.

It looks like in your config.pbtxt you put in dims: [10,10,60,16]. However, you should be setting dims:[-1,10,10,60,16] in config.pbtxt because you are using variable input dimension sizes. Since your model supports the 4D vector, the auto-complete assumes that will be the input type. It does not know that you will have already batched inputs. The problem is in the model configuration sizes, and the error prompt is complaining that changing max_batch_size does not solve the problem.

This is only tangent to dynamic batching on the server side, which additionally batches inputs together to form larger inputs for the machine to process at a time. See here for details on dynamic batching. You are trying to use dynamic batching on your model as well. As noted in your first error prompt, you are setting batching by specifying maximum batch size.

cc: @tanmayv25 @nv-kmcgill53 (for auto-complete)