Batching does not improve performance with dali

Issue

Batching does not improve performance with dali.

Description

In summary, inference slows as we increase batching in our application.

We have an application that sends data to triton for inferencing. As mentioned above, batching does not seem to improve performance with dali. We are using an ensemble model that uses dali for preprocessing and then do object detection with yolo. Specifically, batch size of 8 is significantly slower than batch size of 1. We have only seen that with the dali portion of the application is much slower than the object-detection portion of the application.

Using perf-analyzer with batch sizes 1 and 8 with concurrency with 2 revealed improved inferences/sec as one might expect. However, this has not been observed in the application. Manual timing of the application has shown dali takes up the majority of the time with inferencing (object-detection seems to be fine).

It is worth mentioning that we are testing by sending batches in the application as well as using dynamic batching in triton, as can be seen below in the configs.

Perf Analyzer/Application Infer Timing

We ran our application and timed the average median milliseconds for inference with preprocessing and object detection for batch size 1 and 8. We also ran perf-analyzer for batch size 1 and 8 against triton using the configs provided below.

Batch 1 for object-detection avg request latency 4.623 ms and Batch 1 timing object-detection inferencing our application code is median avg 9.017ms. For batch 1 perf-analyzer, for concurrency: 2, we get throughput: 91.9781 infer/sec.

Batch 8 for object-detection avg request latency 10.749 ms and Batch 8 timing object-detection inferencing our application code is median avg 86.335ms. For batch 8 perf-analyzer, for concurrency: 2, we get throughput 170.247 infer/sec.

Additional information from perf-analyzer has been attached as a csv.

Config Information

Here is the configuration information:

ensemble- config.pbtxt

name: "ensemble"
platform: "ensemble"
max_batch_size: 8 
input [
  {
    name: "frame"
    data_type: TYPE_UINT8
    dims: [ 1080, 1920, 3 ]
  }
]
output [
  {
    name: "yolo_num_detections"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "yolo_detection_boxes"
    data_type: TYPE_FP32
    dims: [ 100, 4 ]
  },
  {
    name: "yolo_detection_scores"
    data_type: TYPE_FP32
    dims: [ 100 ]
  },
  {
    name: "yolo_detection_classes"
    data_type: TYPE_INT32
    dims: [ 100 ]
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "preprocessing"
      model_version: -1
      input_map {
        key: "raw_images"
        value: "frame"
      }
      output_map [
        {
            key: "yolo_prep_output"
            value: "yolo_preprocessed_image"
        }
      ]
    },
    {
      model_name: "object_detection"
      model_version: -1
      input_map [
        {
            key: "images"
            value: "yolo_preprocessed_image"
        }
      ]
      output_map [
            {
                key: "num_detections"
                value: "yolo_num_detections"
            },
            {
                key: "detection_boxes"
                value: "yolo_detection_boxes"
            },
            {
                key: "detection_scores"
                value: "yolo_detection_scores"
            },
            {
                key: "detection_classes"
                value: "yolo_detection_classes"
            }
        ]
    }
  ]
}

object detection - config.pbtxt

name: "object_detection"
platform: "tensorrt_plan"
max_batch_size: 8 
input [
  {
    name: "images"
    data_type: TYPE_FP32
    dims: [ 3, 384, 640 ]
  }
]
output [
  {
    name: "num_detections"
    data_type: TYPE_INT32
    dims: [ 1 ]
  },
  {
    name: "detection_boxes"
    data_type: TYPE_FP32
    dims: [ 100, 4 ]
  },
  {
    name: "detection_scores"
    data_type: TYPE_FP32
    dims: [ 100 ]
  },
  {
    name: "detection_classes"
    data_type: TYPE_INT32
    dims: [ 100 ]
  }
]
instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]
dynamic_batching {
  preferred_batch_size: [ 8 ]
}

pre-processing - config.pbtxt

name: "preprocessing"
backend: "dali"
max_batch_size: 8 
input [
    {
        name: "raw_images"
        data_type: TYPE_UINT8
        dims: [ 1080, 1920, 3 ]
    }
]

output [
    {
        name: "yolo_prep_output"
        data_type: TYPE_FP32
        dims: [ 3, 384, 640 ]
    }
]
dynamic_batching {
  preferred_batch_size: [ 8 ]
}

dali.py

import nvidia.dali as dali
import nvidia.dali.plugin.triton as triton

@triton.autoserialize
@dali.pipeline_def(batch_size=8, num_threads=4, device_id=0)
def pipe():
    images = dali.fn.external_source(device="gpu", name="raw_images")
    images = dali.fn.color_space_conversion(
        images, image_type=dali.types.BGR, output_type=dali.types.RGB
    )

    # YOLO PRE-PROCESSING
    images = dali.fn.resize(images, resize_x=640, resize_y=360)
    pad = dali.types.Constant(
        value=114,
        dtype=dali.types.DALIDataType.UINT8,
        shape=[12, 640, 3],
        layout="HWC",
        device="gpu"
    )
    yolo_images = dali.fn.cat(pad, images, pad, axis=0)
    yolo_images = dali.fn.transpose(yolo_images, perm=[2, 0, 1])
    # normalize
    yolo_images = yolo_images / 255

    return yolo_images

Questions

Why does dali slow down when we introduce batching (by sending data in batches of 8 to triton) for our configurations and not match our perf-analyzer results?
What are some additional things we can try in our configurations to improve performance or what are some clues for potential bottleneck?

Perf-Analyzer CSV Output

ensemble-concur2-ceiling8-batch8.csv ensemble-concur2-ceiling8-batch1.csv

Hello, @hly0025 Just looking at the numbers from the perf_analyzer, increasing batch size provides performance improvement, right? It increases the throughput twice and and the latency is increased only twice (vs 8 times bigger batch).

Can you tell how the time measurement in your application is done and do you suspect why could it yield such different perf results?

Hello @banasraf

Time Measurements

Thanks for your reply. Here is how the measurement was done. I timed, individually, the following for preprocessing and object detection separately in our application code (aka client side of things):

  start_timer = time.perf_counter()
  batch_results = self.triton_client.infer(
                self.triton_config.model_name,
                inputs=inputs,
                outputs=outputs,
                client_timeout=None,
                compression_algorithm=None,
            )
  end = time.perf_counter()
  self.write_to_csv("trt_infer_objectdetection", end - start_timer, tic_filename)

This would write to the file each time we called this function. I then took all the results in the csv file and got the average median time for preprocessing and object detection when sending from our application (client side) a batch of data either in size 1 or size 8.

This is how I would get median avg 9.017ms for doing inferencing with batch size 1 for object detection.

Additional Information

I realize, in the above, I only gave numbers for object detection. Here is some brief stats on the preprocessing:

Batch 1 for pre-processor avg request latency is 2.374 ms and Batch 1 timing pre-processor inferencing in our application code is median avg 21.94ms

Batch 8 for pre-processor avg request latency 10.749 ms and Batch 8 timing pre-processor inferencing in our application code is median avg 238.028ms

We are using triton tritonserver:22.08-py3 and the dali version supported by that, which is 1.16.0.

Summary

For me, I think the biggest mystery is perf-analyzer results of 10.749ms at batch size 8 and the application (client side) sending a batch size of 8 being much slower at 238.02 ms. I would think with batching, it would be quicker.

@hly0025 ,

Thank you for the thorough analysis. I've got somewhat confused by all the numbers you provided, so I've put them in a table. Could you please verify, if all these numbers and descriptions are correct according to your data?

Numbers

| Application end-to-end | Application end-to-end | Application preprocessing | Application preprocessing | perf_analyzer | perf_analyzer -- | -- | -- | -- | -- | -- | -- batch_size | Mean [ms] | Median [ms] | Mean [ms] | Median [ms] | Throughput [inf/s] | p50 Latency [ms] 1 | 4.62 | 9.02 | 2.37 | 21.9 | 92 | 22 8 | 10.7 | 86.3 | 10.7 | 238 | 170 | 86.7 I'm especially confused about the `238 ms` and `9.02` measurements. It appears, that apart from these two, everything else makes perfect sense and I'll be explaining it in next paragraphs. Right now I'll skip the `238ms` and `9.02`, but please provide more details about it. ## Analysis First of all, let's note that the `Application` and `perf_analyzer` results **are consistent with each other**: 1. For `batch_size=1`, Application preprocessing `median=21.9` while perf_analyzer `median=22`, 2. For `batch_size=8`, Application preprocessing `median=86.3` while perf_analyzer `median=86.7`. Secondly, we shall also note, that the Application measurements are the **latency** measurements. While perf_analyzer also provides throughput, it is not measured by Application with the code snippet you've provided. So the remaining question is, *why there is no perf improvement when using `batch_size=8` with DALI?*. Actually, **there is 2x improvement!** Assuming, that we are inferencing with a `batch_size=1`, the median latency of a single sample is just what the number gives, i.e. `22 ms`. However, assuming we have a `batch_size=8`, the approximate median latency of a single sample is the measurement value, but **divided by the batch size**. Therefore it's about `latency/batch_size = 86/8 = 10.8 ms`. As we can see, this number is about 2 times smaller than the single sample latency when using `batch_size=1`. Would this analysis be reasonable with regards to your environment and requirements? Please let me know if you have any questions. Also, in case something in my analysis looks incorrect, it would be great if you'd clarify the two measurements I mentioned above.

@szalpal

Thanks for your reply and thorough remarks. I appreciate your patience to comb through my explanation to try to understand the issue. On my side, I believe it is a good idea putting the numbers into a table. Here is, upon your request, my review of the numbers. Your analysis is very close, but an important distinction to make is that on the application side, I decided to time the infer time for object-detection and pre-processing separately.

For now, I summarize the data again, and I hope this better conveys the information:

Timing Numbers

Perf-analyzer Results

Here are the results obtained using perf analyzer using the triton configs I shared above:

batch_size	Throughput inf/sec	p50 latency [ms]	p95 latency [ms]	object-detection-latency [ms]	preprocessing-latency [ms]
1	91.9781	21.98	29.972	4.623	2.374
8	170.247	86.67	130.034	7.179	10.749

Application Results

Here are the results obtained using the timing code via triton.infer. As stated above, I timed the object detection and pre-processing separately.

Batch Size	Object Detection Median Avg [ms]	Preprocessing Median Avg [ms]
1	9.017	21.94
8	86.335	238.028

Summary

The real confusion for me is that the application does not perform like I would expect at batch size 8. I hope separating out the results that I obtained via perf-analyzer versus those obtained from the application will help make things clearer. In my mind, the pre-processing is taking some time and I am not sure why.

To briefly recap, the application is guaranteed to send a batch size of 8 from the client side to triton. Always. Given the configs, my understanding is that Triton should process this batch of 8 from the client side as one batch and send it back. However, at least where pre-processing is concerned, it seems to be having an issue.

Thanks kindly again for your thorough remarks and response.

PS - I can time ensemble (doing it all at once) if you like. However, my hope is by digging into each pre-processor and object-detection separately, that would help with the diagnostics, so to speak.

@hly0025 ,

Thank you for clarifying the numbers. To be frank, I rather trust the perf_analyzer measurements and they actually look promising (2ms for batch_size=1 and 10ms for batch_size=8 provides a nice gain).

Could we take some time to verify, whether the Application measurements are reliable? I mean, perf_analyzer by default runs multiple iteration until the time measurements are stable enough. Could you tell, how many inference iterations you've run when taking these measurements? Also, it is natural that first few iterations will be slower because of memory allocation that happen underneath. Are you conducting the warmup before running the performance test? Could you also provide a little bit more statistics? You've measured median, is it possible to measure average and standard deviation? The more data you'd provide the higher chance we have to find the root cause of the discrepancy between the results.

@szalpal

Thank you for your reply, that makes sense.

Brief Recap

On the application side (client side), we set batch size to 8. This ensures that we are sending data in batch sizes of 8 to the triton_client.infer. For iterations, I can count how many times we have an observation in our csv timing file if you feel that would be useful. I'm not sure what you mean by a "warmup", but to address that issue and also your excellent point about outliers, we timed it for 15 approximately minutes.

  start_timer = time.perf_counter()
  batch_results = self.triton_client.infer(
                self.triton_config.model_name,
                inputs=inputs,
                outputs=outputs,
                client_timeout=None,
                compression_algorithm=None,
            )
  end = time.perf_counter()
  self.write_to_csv("trt_infer_objectdetection", end - start_timer, tic_filename)

For statistics, please see IQR, min, max, median, 25th, and 75 percentile.

Method	Min	Max	Median	IQR	25th Percentile	75th Percentile
Object Detection	58.729ms	175.925ms	86.335ms	22.646ms	73.186ms	95.831ms
Preprocessing	159.745ms	897.239ms	238.028ms	69.040ms	214.081ms	283.122ms

Summary

I can provide standard deviation and average if desired, but I hope the min, max, median, and the IQR helps address the core of what I hope you need to assess the application side of things more clearly. While this is admittedly, if you pardon the colloquial english expression: This is admittedly somewhat apples (application timing) to oranges (perf-analyzer). Nevertheless, I believe the preprocessing is still slower than I would anticipate based on the perf-analyzer.

Thank you for your remarks and questions.

Gentle inquiry @szalpal if there is a status update on this or thoughts? Thanks kindly in advance!

I also use dali as my model preprocess. No matter how the parameters are adjusted, the throughput of DALI does not improve and can only reach a maximum of 750. However, if I use nvJPEGDecMultipleInstances for decoding, the decoding efficiency can reach 2100.

I am using the COCO/val2017 dataset and running it on an A10.

dali pipe

def preprocessing(images, device='gpu'):
    images_ori = dali.fn.decoders.image(
        images, device="mixed", output_type=types.BGR)
    return images_ori

@dali.pipeline_def(batch_size=32, num_threads=32, device_id=0)
def pipe():
    images = dali.fn.external_source(
        device="cpu", name="encoded", no_copy=True)
    return preprocessing(images)

dali config.pbtxt

name: "dali_preprocess_yolo"
backend: "dali"
max_batch_size: 32
input [
  {
    name: "encoded"
    data_type: TYPE_UINT8
    dims: [ -1 ]
    allow_ragged_batch: true
  }
]

output [
  {
    name: "original"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3]
  }
]

dynamic_batching {
  preferred_batch_size: [32]
  max_queue_delay_microseconds: 100000
}

parameters: [
  {
    key: "num_threads"
    value: { string_value: "32" }
  }
]

instance_group [
    {
      count: 4
      kind: KIND_GPU
      gpus: [ 0 ]
    }
]

perf_analyzer parameters perf_analyzer -i grpc -u $HTTP_ADDR -p$TIME_WINDOW -m bls_async_pre1 --input-data dataset.json --concurrency-range=64:192:64

result

*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 10000 msec
  Latency limit: 0 msec
  Concurrency limit: 192 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 64
  Client: 
    Request count: 27480
    Throughput: 755.607 infer/sec
    Avg latency: 84710 usec (standard deviation 2205 usec)
    p50 latency: 30150 usec
    p90 latency: 40957 usec
    p95 latency: 589347 usec
    p99 latency: 624066 usec
    Avg gRPC time: 84695 usec ((un)marshal request/response 30 usec + response wait 84665 usec)
  Server: 
    Inference count: 27489
    Execution count: 866
    Successful request count: 27489
    Avg request latency: 103727 usec (overhead 24 usec + queue 8277 usec + compute input 97 usec + compute infer 8180 usec + compute output 87148 usec)

Request concurrency: 128
  Client: 
    Request count: 26264
    Throughput: 721.068 infer/sec
    Avg latency: 174870 usec (standard deviation 21386 usec)
    p50 latency: 51439 usec
    p90 latency: 677091 usec
    p95 latency: 698442 usec
    p99 latency: 783479 usec
    Avg gRPC time: 174866 usec ((un)marshal request/response 50 usec + response wait 174816 usec)
  Server: 
    Inference count: 26163
    Execution count: 825
    Successful request count: 26163
    Avg request latency: 198526 usec (overhead 29 usec + queue 30932 usec + compute input 123 usec + compute infer 13009 usec + compute output 154433 usec)

Request concurrency: 192
  Client: 
    Request count: 27668
    Throughput: 756.204 infer/sec
    Avg latency: 252569 usec (standard deviation 19340 usec)
    p50 latency: 78878 usec
    p90 latency: 701562 usec
    p95 latency: 720703 usec
    p99 latency: 1253034 usec
    Avg gRPC time: 252548 usec ((un)marshal request/response 52 usec + response wait 252496 usec)
  Server: 
    Inference count: 27680
    Execution count: 865
    Successful request count: 27680
    Avg request latency: 280895 usec (overhead 29 usec + queue 117216 usec + compute input 123 usec + compute infer 13554 usec + compute output 149972 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 64, throughput: 755.607 infer/sec, latency 84710 usec
Concurrency: 128, throughput: 721.068 infer/sec, latency 174870 usec
Concurrency: 192, throughput: 756.204 infer/sec, latency 252569 usec

nvJPEGDecMultipleInstances parameters ./nvJPEGDecMultipleInstances -i /mnt/share/2/dataset/coco/images/val2017/ -j 16 -b 16 -batched -t 10000 -w 100 -fmt unchanged

nvJPEGDecMultipleInstances result Total decoding time: 4.69204 (s) Avg decoding time per image: 0.000469204 (s) Avg images per sec: 2131.27 Avg decoding time per batch: 0.00750727 (s) params.num_threads: 16 params.batch_size: 16

@qihang720 ,

In the snippet you've provided (perf_analyzer parameters), I see you're benchmarking bls_async_pre1 model, not the dali_preprocess_yolo model. Could you double check, if the numbers you've provided are for dali_preprocess_yolo model?

I'm so sorry for my delayed response.

I check my command，previously I use bls_async_pre1 to do benchmark, this is using python backend to do dali pipeline.

next result is using dali backend , Using the Dali backend is faster than using the Python backend， but GPU performance never reaches its maximum.

Successfully read data for 1 stream/streams with 5000 step/steps.
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "time_windows" mode for stabilization
  Measurement window: 100000 msec
  Latency limit: 0 msec
  Concurrency limit: 192 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 64
  Client: 
    Request count: 314744
    Throughput: 873.201 infer/sec
    Avg latency: 73211 usec (standard deviation 5031 usec)
    p50 latency: 33205 usec
    p90 latency: 53482 usec
    p95 latency: 429073 usec
    p99 latency: 545507 usec
    Avg gRPC time: 73189 usec ((un)marshal request/response 41 usec + response wait 73148 usec)
  Server: 
    Inference count: 314713
    Execution count: 13860
    Successful request count: 314713
    Avg request latency: 83935 usec (overhead 27 usec + queue 5470 usec + compute input 111 usec + compute infer 10189 usec + compute output 68137 usec)

Request concurrency: 128
  Client: 
    Request count: 320336
    Throughput: 888.4 infer/sec
    Avg latency: 144054 usec (standard deviation 4045 usec)
    p50 latency: 53815 usec
    p90 latency: 502949 usec
    p95 latency: 530911 usec
    p99 latency: 642329 usec
    Avg gRPC time: 144028 usec ((un)marshal request/response 52 usec + response wait 143976 usec)
  Server: 
    Inference count: 320296
    Execution count: 11586
    Successful request count: 320296
    Avg request latency: 160229 usec (overhead 33 usec + queue 6215 usec + compute input 155 usec + compute infer 13166 usec + compute output 140659 usec)

Request concurrency: 192
  Client: 
    Request count: 305806
    Throughput: 848.006 infer/sec
    Avg latency: 226562 usec (standard deviation 6402 usec)
    p50 latency: 79282 usec
    p90 latency: 619427 usec
    p95 latency: 662947 usec
    p99 latency: 760955 usec
    Avg gRPC time: 226536 usec ((un)marshal request/response 60 usec + response wait 226476 usec)
  Server: 
    Inference count: 305762
    Execution count: 9670
    Successful request count: 305762
    Avg request latency: 251549 usec (overhead 38 usec + queue 17388 usec + compute input 199 usec + compute infer 16426 usec + compute output 217497 usec)

Inferences/Second vs. Client Average Batch Latency
Concurrency: 64, throughput: 873.201 infer/sec, latency 73211 usec
Concurrency: 128, throughput: 888.4 infer/sec, latency 144054 usec
Concurrency: 192, throughput: 848.006 infer/sec, latency 226562 usec

triton-inference-server / dali_backend