triton-inference-server / dali_backend

The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
MIT License
123 stars 29 forks source link

How to format client code for inception example #237

Closed Skier23 closed 5 months ago

Skier23 commented 6 months ago

I'm trying to make a dali preprocessing pipeline extremely similar to the inception example but I can't seem to figure out how to get the correct format on the client side. Here is the code I have now on the client side:

def prepare_batches(data_loader):
    batches = []
    for dataiter in data_loader:
        images, labels = dataiter['image'], dataiter['label']
        #images is a list of encoded images as strings
        images_np = np.array(images, dtype=object)
        batches.append((images_np, labels))
    return batches

def infer_batch(client, batch):
    images_np, labels = batch
    inputs = grpcclient.InferInput("x", [len(images_np)], datatype="BYTES")
    inputs.set_data_from_numpy(images_np)

    try:
        detection_response = client.infer(model_name="simple_ensemble", inputs=[inputs])
        predictions = detection_response.as_numpy('classifier')
        return predictions, labels.numpy()
    except Exception as exc:
        print(f"Exception during inference: {exc}")
        return None, None

However this code doesnt work with the dali pipeline config file from the example: https://developer.nvidia.com/blog/accelerating-inference-with-triton-inference-server-and-dali/

max_batch_size: 256
input [
  {
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1 ]
  }
]

because triton is expecting input with a shape [-1, -1] but its getting data in the format [ -1 ].

unexpected shape for input 'x' for model 'simple_ensemble'. Expected [-1,-1], got [4].

What would be the correct way to format the data on the client side?

szalpal commented 6 months ago

@Skier23 ,

could you attach a configuration for simple_ensemble model?

Skier23 commented 6 months ago

Yea. It's basically just a straight passthrough to the dali model:

name: "simple_ensemble"
platform: "ensemble"
max_batch_size: 64
input [
  {
    name: "x"
    data_type: TYPE_UINT8  # Encoded images are sent as strings
    dims: [ -1 ]
  }
]
output [
  {
    name: "classifier"
    data_type: TYPE_FP16
    dims: [ 36 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "dali_preprocessing"
      model_version: -1
      input_map {
        key: "DALI_INPUT_0"
        value: "x"
      }
      output_map {
        key: "DALI_OUTPUT_0"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "maxvit_rmlp_base"
      model_version: -1
      input_map {
        key: "x"
        value: "preprocessed_image"
      }
      output_map {
        key: "classifier"
        value: "classifier"
      }
    }
  ]
}
szalpal commented 6 months ago
max_batch_size: 64
input [
  {
    name: "x"
    data_type: TYPE_UINT8  # Encoded images are sent as strings
    dims: [ -1 ]
  }
]

I believe this part of the config is the problem. The dims: [-1] specifies the shape of the input sample. Additionally, the max_batch_size option determines, that the batching is used, therefore Triton expects that the input will have 2 dimensions: [batch, sample]. The error message points out, that you're passing one-dimensional input. Would you mind double-checking this?

Skier23 commented 6 months ago

Yea. Exactly that is the problem. But that's the exact same setup that the linked example/tutorial uses. I would've thought a one dimensional input from the client where the dimension is the batch size and includes the encoded string for each image in the batch would have made more sense personally. However it looks like the tutorial/example also expects a 2 dimensional input so I'm asking, if that is correct for a dali pipeline like in the example tutorial, how should the client code be formatted to give triton that data in the right format?

Skier23 commented 6 months ago
max_batch_size: 64
input [
  {
    name: "x"
    data_type: TYPE_UINT8  # Encoded images are sent as strings
    dims: [ -1 ]
  }
]

I believe this part of the config is the problem. The dims: [-1] specifies the shape of the input sample. Additionally, the max_batch_size option determines, that the batching is used, therefore Triton expects that the input will have 2 dimensions: [batch, sample]. The error message points out, that you're passing one-dimensional input. Would you mind double-checking this?

Also, when trying with a batch size of 0 so that the input gives the 1 dimensional input, it gets another error:

[8bc16ad44c00:1    :0:145] Caught signal 8 (Floating point exception: integer divide by zero)
==== backtrace (tid:    145) ====
 0 0x0000000000042520 __sigaction()  ???:0
 1 0x0000000000043d8d triton::backend::dali::DaliPipeline::SetInput()  :0
 2 0x00000000000440d5 triton::backend::dali::DaliPipeline::SetInput()  :0
 3 0x0000000000040512 triton::backend::dali::DaliExecutor::SetupInputs()  :0
 4 0x0000000000040ebe triton::backend::dali::DaliExecutor::Run()  :0
 5 0x000000000002de64 triton::backend::dali::DaliModelInstance::ProcessRequest()  :0
 6 0x000000000002e47a triton::backend::dali::DaliModelInstance::ExecuteUnbatched()  :0
 7 0x000000000001a960 TRITONBACKEND_ModelInstanceExecute()  ???:0
 8 0x00000000001a8d74 triton::core::TritonModelInstance::Execute()  :0
 9 0x00000000001a90db triton::core::TritonModelInstance::Schedule()  :0
10 0x00000000002bd9bd triton::core::Payload::Execute()  :0
11 0x00000000001acd64 triton::core::TritonModelInstance::TritonBackendThread::BackendThread()  :0
12 0x00000000000dc253 std::error_code::default_error_condition()  ???:0
13 0x0000000000094ac3 pthread_condattr_setpshared()  ???:0
14 0x0000000000125a04 clone()  ???:0

This leads me to believe that this likely isnt the correct input format.

Skier23 commented 6 months ago

@szalpal any ideas on the format the data needs to be on the client side when sent to triton in the tutorial example?

Skier23 commented 5 months ago

@banasraf Any ideas on how the format of the data should be from the clientside when sent to triton for a DALI pipeline that expects encoded images?

banasraf commented 5 months ago

@Skier23

The client should send a 2-dimensional tensor where the first dimension is a batch dimension. We do not support the BYTES type as an input right now. Binary data should be sent as UINT8 tensors. This means that all the files that you send in a batch need be of equal length (as you send them as rows in a a tensor). You can pad the data with 0s at the end of each file to make them of equal sizes.

Skier23 commented 5 months ago

Thanks for the reply! In a productionalized environment (as I'd classify DALI with Triton), you wouldn't usually have the same input sizes for all the images. If I were to pad the inputs to some fixed size, I'm not sure how to know what that size would be. The other alternative would be to resize the images on the client side. But then that takes out part of the few steps I would want to do in the DALI pipeline: resizing, normalizing, and setting the data type. If we're already doing the resizing on the client side it seems like we're losing some of the performance gain. Perhaps the alternative could be running the whole DALI pipeline on the clientside and sending the images to the server (or perhaps encoding on the client and then just decoding on the server side). But even in this pattern, the inputs to the clientside DALI pipeline would also be images with non-consistent sizes so I guess I'm struggling to see how an optimized flow might work?

banasraf commented 5 months ago

In a productionalized environment (as I'd classify DALI with Triton), you wouldn't usually have the same input sizes for all the images.

That's true, but when you compose a batch to be sent to Triton as a single request you have the access to all the images that you want to send, so you can pad them to the size of the biggest one. And the requirement of uniform size applies only to images sent in a single batch, so there's no need to track any size in between the requests.

Skier23 commented 5 months ago

Gotcha. That does make sense. Regarding the above options, which pattern would you generally think would be optimal:

  1. Doing the full dali pipeline in the client and sending the raw image data to the model in triton
  2. Doing the full dali pipeline in the client, re-encoding, sending to dali ensemble which decodes the image and sends to the model
  3. Not using dali and just padding to the largest encoded image and sending to dali on triton to decode, resize, and normalize.

I'd imagine in 1, the payloads to triton would be medium size from resizing but still decently large because they aren't encoded With 2, the payload would be the smallest but we have a bit of excessive decoding, re-encoding, and decoding again and With 3 we have a bit of extra payload due to images potentially a good bit larger than the resized size but the images are encoded.

banasraf commented 5 months ago

You could try different options and compare them because cost of specific setups might depend on you environment.

Option 3 would be the one that we usually propose. Generally, encoding images (e.g. to jpegs) reduces their size dramatically, so usually sending decoded data adds a lot of communication overhead. But, if you resize them down to dimensions that are much smaller than the original image, they might not impose that much of overhead.

I would rather avoid decoding, encoding and then decoding it again on the server. This is a costly operation that tends to dominate the cost of the whole preprocessing pipeline, so multiplying this work might be too expensive.

Skier23 commented 5 months ago

You could try different options and compare them because cost of specific setups might depend on you environment.

Option 3 would be the one that we usually propose. Generally, encoding images (e.g. to jpegs) reduces their size dramatically, so usually sending decoded data adds a lot of communication overhead. But, if you resize them down to dimensions that are much smaller than the original image, they might not impose that much of overhead.

I would rather avoid decoding, encoding and then decoding it again on the server. This is a costly operation that tends to dominate the cost of the whole preprocessing pipeline, so multiplying this work might be too expensive.

That makes sense. I'll start with trying option 3 and go from there. Thanks for the help!

Skier23 commented 5 months ago

That did end up working. I should have enough here to try out some different approaches. Thanks!