when using crop_mirror_normalize func, Output layout "CHW" is slower than "HWC"

qihang720 commented 12 months ago

output layout is CHW, using perf_analyzer to profile.

@autoserialize
@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
def pipe():
images = dali.fn.external_source(device="cpu", name="encoded")
images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
images = dali.fn.resize(images, resize_x=299, resize_y=299)
images = dali.fn.crop_mirror_normalize(images,
                                       dtype=types.FLOAT,
                                       output_layout="CHW",
                                       crop=(299, 299),
                                       mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                                       std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
return images

output layout is HWC

@autoserialize
@dali.pipeline_def(batch_size=3, num_threads=1, device_id=0)
def pipe():
images = dali.fn.external_source(device="cpu", name="encoded")
images = dali.fn.decoders.image(images, device="mixed", output_type=types.RGB)
images = dali.fn.resize(images, resize_x=299, resize_y=299)
images = dali.fn.crop_mirror_normalize(images,
                                       dtype=types.FLOAT,
                                       output_layout="HWC",
                                       crop=(299, 299),
                                       mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                                       std=[0.229 * 255, 0.224 * 255, 0.225 * 255])
return images

Most of time, model input layout is "NCHW", is there any way we can improve performance?

JanuszL commented 12 months ago

Hi @qihang720,

Can you provide more details about the environment you are using to run your tests? Can you reproduce a similar number using DALI as a standalone library? Just recently we introduced a couple of optimizations to crop mirror normalize so updating the TRITON version to the latest can be a good first step to confirm if the use case you have has improved.

qihang720 commented 12 months ago

Hi @qihang720,

Can you provide more details about the environment you are using to run your tests? Can you reproduce a similar number using DALI as a standalone library? Just recently we introduced a couple of optimizations to crop mirror normalize so updating the TRITON version to the latest can be a good first step to confirm if the use case you have has improved.

I used nvcr.io/nvidia/tritonserver:23.05-py3 as my working environment. dali version : nvidia-dali-cuda110 1.29.0

I'm not sure how to profile DALI alone，because every input images' length is different. Triton can help me to batch dynamic shape when I add ragged_batches option.

I will test in new version lately.

JanuszL commented 12 months ago

Hi @qihang720,

nvcr.io/nvidia/tritonserver:23.05-py3 uses DALI 1.25. In DALI 1.30 we did a couple of optimizations for the crop_mirror_normalize operator. Please stay tuned for TRITON 23.10 which should include this DALI version. Also, the biggest gain from the GPU processing is visible when you process a batch of data. Do you see similar results for bigger batches?

qihang720 commented 12 months ago

Hi @JanuszL，

Thanks for your advices, I will continue to follow TRITON 23.10.

For batch of data, if my every input is different, the base64 length is also different, so how can I batch it together.

JanuszL commented 12 months ago

Hi @qihang720,

For batch of data, if my every input is different, the base64 length is also different, so how can I batch it together.

Please check if this part of our documentation answers your question.

triton-inference-server / dali_backend

when using crop_mirror_normalize func, Output layout "CHW" is slower than "HWC" #216