triton-inference-server / onnxruntime_backend

The Triton backend for the ONNX Runtime.
BSD 3-Clause "New" or "Revised" License
134 stars 57 forks source link

Add support for ragged batching (especially useful for BERT-type models). #21

Closed DavidLangworthy closed 3 years ago

DavidLangworthy commented 3 years ago

This requires work in the backend. (David we can talk more. We used this in our BERT submission) Significant new code and some coordination. Potentially very valuable. Much more than 10%, significant.

kpedro88 commented 3 years ago

Synchronizing with the comments from https://github.com/triton-inference-server/server/issues/2158:

Is your feature request related to a problem? Please describe.

In scientific data analysis workflows, a single "event" (corresponding to a single request to the Triton server), may contain multiple objects such as clusters, each of which has a different number of components, and therefore a different number of features. The input data for these variable objects can be sent in a single request using "ragged batching", but currently this is only supported for the TensorRT backend (and any custom backends that happen to implement it).

Describe the solution you'd like It would be very useful to support this feature in common ML backends: ONNX, PyTorch, TensorFlow.

Describe alternatives you've considered Feature vectors can be padded to a universal length on the client side, but this is tedious (and potentially less efficient, as the request consumes more bandwidth to transmit useless information).

odivay commented 3 years ago

As it stands today, ONNX RT isn't competitive when you run an NLP graph at scale on GPUs, for the reason dynamic batching a la Triton IS isn't supported. The main issue is the lack of ragged tensor support in ONNX, when TensorFlow offers tf.Example, and PyTorch is implementing NestedTensor. In theory, individual clients can pad, but in practice this is wasteful: You end up spending most GPU cycles processing zeroes (a lot more than usual), in addition to increasing network overhead. This translates into higher latency and GPU costs in production.

askhade commented 3 years ago

https://github.com/triton-inference-server/onnxruntime_backend/pull/39

askhade commented 3 years ago

@DavidLangworthy : Closing this issue since #39 is merged. It is part of 21.05 release.

Sitcebelly commented 1 year ago

Could you provide me how to create appropriate model for using ragged batching? My task is to create sentiment model with ragged batching to increase the speed. Is i understand correct what i need to create new model from huggingface mode which can handle 2 inputs, one for ragged input with shape [-1] and another for batch_accumulation_inputs which also have shape [-1]? If yes there are one problem what torch.onnx convert torch models to onnx only with batch dimensions, i looked at exmaplese in server repository and found the code for test which generate inputs in [-1] dimensions and after that reshape it to [batch_size, -1]. So the plan looks like i need convert torch to onnx with batching which batch size = 1 with 2 inputs: one for ragged batch and second for batch_accumulation_input, and after that i need to extend to this graph onnx reshaper for [-1] dimensions to [1, -1] which can handle my converted model. So is I understand correct it's only one way how can i convert model from torch to onnx to use ragged batching? Please explain me how to convert models from torch to onnx for use ragged batching @askhade, @kpedro88

Sitcebelly commented 1 year ago

Ok, I prepeared simple example

I created simple summing model which have input and length in shape = [-1]

import torch
import torch.nn as nn

class SummingModel(nn.Module):
    def forward(self, input, lengths):
        batch_size = lengths.shape[0]
        sums = torch.zeros(batch_size, 1)
        start = 0
        for i in range(batch_size):
            end = start + lengths[i]
            sums[i][0] = input[start:end].sum()
            start = end
        return sums

model = SummingModel()
dummy_input = torch.randn(10), torch.tensor([2, 7, 10])

dynamic_axes = {
    'input': {0: 'length'},  # dynamic batch size for `input`
    'lengths': {0: 'length'},  # dynamic batch size for `lengths`
    'output': {0: 'batch_size'},  # dynamic batch size for output
}

torch.onnx.export(model, dummy_input, "1/model.onnx", input_names = ['input', 'lengths'], output_names = ['output'], dynamic_axes=dynamic_axes)

And triton adopted this model. With config like:

name: "SummingModel"
max_batch_size: 16
platform: "onnxruntime_onnx"

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ -1 ]
    allow_ragged_batch: true
  }
]

batch_input [
  {
    kind: BATCH_ACCUMULATED_ELEMENT_COUNT
    target_name: "lengths"
    data_type: TYPE_FP32
    source_input: "input"
  }
]

output [
  {
    name: "output"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]

But when i try to run this model with tritonhttpclient it throw me error

InferenceServerException: [400] onnx runtime error 2: Unexpected input data type. Actual: (tensor(float)) , expected: (tensor(int64))

It's also interesting why in example in documentation datatype for BATCH_ACCUMULATED_ELEMENT_COUNT is float32 not int.

Ok, after that i change input type to int64 for lengths in my config to int64 and now it not put it to triton, only put in with dtype TYPE_INT32, but torch.tensor([2, 7, 10]) have shape int64.

What i do wrong? Could you help me please?