triton-inference-server / dali_backend

The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
MIT License
120 stars 28 forks source link

dali backend device parameter setting question #155

Open frankxyy opened 2 years ago

frankxyy commented 2 years ago

The dali.py file content is as below:

import nvidia.dali as dali
from nvidia.dali.plugin.triton import autoserialize
import nvidia.dali.types as types

@autoserialize
@dali.pipeline_def(batch_size=1, num_threads=1, device_id=0)
def pipe():
    images = dali.fn.external_source(device="cpu", name="INPUT_0")
    shape_list = dali.fn.external_source(device="cpu", name="INPUT_1")
    images = dali.fn.decoders.image(device="mixed", images, device="mixed", output_type=types.RGB) # The output of the decoder is in HWC layout.
    images_converted = dali.fn.color_space_conversion(device="gpu", images, image_type=types.RGB, output_type=types.BGR)
    images = dali.fn.resize(device="gpu", images_converted, resize_y=shape_list[0, 2]*shape_list[0, 0], resize_x=shape_list[0, 3]*
                            shape_list[0, 1])
    images = dali.fn.crop_mirror_normalize(device="gpu", images,
                                           dtype=types.FLOAT,
                                           output_layout="CHW",
                                           scale=1.0/255,
                                           mean=[0.485 * 255, 0.456 * 255, 0.406 * 255],
                                           std=[0.229, 0.224, 0.225])

    return images, shape_list

A peculiar circumstance I found is that if I donot set the device parameter for the color_space_conversion, resize and crop_mirror_normalize operator, the latency will boost to 90ms(comparing to 40ms when explicitly setting the device parameter to 'gpu'). I assumed that if the device parameter is not set, the default gpu to gpu behavior will be selected as the input of the three operators are all in gpu memory, but the program running result reveals that my assumption may be wrong. I am wondering why does this happen?

banasraf commented 2 years ago

Hi @frankxyy As you assume, adding device='gpu' argument to those operators shouldn't change anything, because they receive gpu input and their placement is inferred to be on gpu. Can you tell me more how do you measured that latency? Did you use perf_analyzer or your custom script? What parameters did the measurements have?