triton-inference-server / dali_backend

The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
MIT License
123 stars 29 forks source link

when using crop_mirror_normalize func, Output layout "CHW" is slower than "HWC" #216

Open qihang720 opened 12 months ago

qihang720 commented 12 months ago
  1. output layout is CHW, using perf_analyzer to profile.

    @autoserialize
    @dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
    def pipe():
    images = dali.fn.external_source(device="cpu", name="encoded")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, resize_x=299, resize_y=299)
    images = dali.fn.crop_mirror_normalize(images,
                                           dtype=types.FLOAT,
                                           output_layout="CHW",
                                           crop=(299, 299),
                                           mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                                           std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
    return images

    image

  2. output layout is HWC

    @autoserialize
    @dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
    def pipe():
    images = dali.fn.external_source(device="cpu", name="encoded")
    images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
    images = dali.fn.resize(images, resize_x=299, resize_y=299)
    images = dali.fn.crop_mirror_normalize(images,
                                           dtype=types.FLOAT,
                                           output_layout="HWC",
                                           crop=(299, 299),
                                           mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                                           std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
    return images

    image

Most of time, model input layout is "NCHW", is there any way we can improve performance?

JanuszL commented 12 months ago

Hi @qihang720,

Can you provide more details about the environment you are using to run your tests? Can you reproduce a similar number using DALI as a standalone library? Just recently we introduced a couple of optimizations to crop mirror normalize so updating the TRITON version to the latest can be a good first step to confirm if the use case you have has improved.

qihang720 commented 12 months ago

Hi @qihang720,

Can you provide more details about the environment you are using to run your tests? Can you reproduce a similar number using DALI as a standalone library? Just recently we introduced a couple of optimizations to crop mirror normalize so updating the TRITON version to the latest can be a good first step to confirm if the use case you have has improved.

I used nvcr.io/nvidia/tritonserver:23.05-py3 as my working environment. dali version : nvidia-dali-cuda110 1.29.0

I'm not sure how to profile DALI alone,because every input images' length is different. Triton can help me to batch dynamic shape when I add ragged_batches option.

I will test in new version lately.

JanuszL commented 12 months ago

Hi @qihang720,

nvcr.io/nvidia/tritonserver:23.05-py3 uses DALI 1.25. In DALI 1.30 we did a couple of optimizations for the crop_mirror_normalize operator. Please stay tuned for TRITON 23.10 which should include this DALI version. Also, the biggest gain from the GPU processing is visible when you process a batch of data. Do you see similar results for bigger batches?

qihang720 commented 12 months ago

Hi @JanuszL,

Thanks for your advices, I will continue to follow TRITON 23.10.

For batch of data, if my every input is different, the base64 length is also different, so how can I batch it together.

JanuszL commented 12 months ago

Hi @qihang720,

For batch of data, if my every input is different, the base64 length is also different, so how can I batch it together.

Please check if this part of our documentation answers your question.