Multi-GPU Configuration with dali Signal (11)s on triton 21.12 during ensemble processing

Hi!

I'm trying to use dali for preprocessing images to send to yolo. I have a two GPU system, that after a few minutes running dali, segfaults and dies. The Dali step is run in an triton ensemble model that sends data to a yolo model compiled to tensorrt. The stacktrace seems to indicate the problem is within dali.

Repro:

The preprocessing code is the following:

import nvidia.dali as dali

@dali.pipeline_def(batch_size=64, num_threads=4, device_id=0)
def pipe():
    images = dali.fn.external_source(
        device="gpu", name="DALI_INPUT_0", dtype=dali.types.UINT8)
    images = dali.fn.color_space_conversion(
        images, image_type=dali.types.BGR, output_type=dali.types.RGB, device='gpu')
    images = dali.fn.resize(images, mode="not_larger",
                            resize_x=640, resize_y=384, device='gpu')
    images = dali.fn.crop(images, crop_w=640, crop_h=384, crop_pos_x=0, crop_pos_y=0,
                          fill_values=114, out_of_bounds_policy="pad", device='gpu')
    images = dali.fn.transpose(images, perm=[2, 0, 1], device='gpu')
    images = dali.fn.cast(images, dtype=dali.types.FLOAT, device='gpu')
    return images

pipe().serialize(filename="1/model.dali")

and config.pbtxt

name: "preprocessbgr"
backend: "dali"
max_batch_size: 64 
input [
{
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3 ]
}
]

output [
{
    name: "OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 3, 384, 640 ]
}
]
dynamic_batching { }

There are about 100fps (1 batch size with 32 concurrency) of images being sent to triton, causing it to throw this error after several rounds of processing. The error reproduces after a few minutes:

Signal (11) received.
 0# 0x00005572DF1FBBD9 in tritonserver
 1# 0x00007F6BFF552210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007F6BF5602D40 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 3# 0x00007F6BF57383E3 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 4# 0x00007F6BF585EB02 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 5# 0x00007F6BF55BF2E3 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 6# 0x00007F6BF55BFAC4 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 7# 0x00007F6BF55C1BD5 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 8# 0x00007F6BF562FAAE in /usr/local/cuda/compat/lib.real/libcuda.so.1
 9# 0x00007F6BC80CF0D9 in /opt/tritonserver/backends/dali/libtriton_dali.so
10# 0x00007F6BC809EFED in /opt/tritonserver/backends/dali/libtriton_dali.so
11# 0x00007F6BC80F3B65 in /opt/tritonserver/backends/dali/libtriton_dali.so
12# 0x00007F6BC8098130 in /opt/tritonserver/backends/dali/libtriton_dali.so
13# dali::ThreadPool::ThreadMain(int, int, bool) in /opt/tritonserver/backends/dali/dali/libdali.so
14# 0x00007F6ABDAE526F in /opt/tritonserver/backends/dali/dali/libdali.so
15# 0x00007F6BFFDBE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
16# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Also, interestingly enough, running this with

instance_group [
    {
        count: 1
        kind: KIND_GPU
        gpus: [ 0 ]
    }
]

Causes Signal (11) to show up faster (seconds) rather than minutes during ensembling over two GPUs.

Other things I've tried

I've tried multiple changes, such as

calling .gpu() and setting the external source to cpu, which causes the signal 11 to show up later.
single GPU instance makes dali play nicely and does not crash.

Theories

There is some kind of weirdness with the dali backend confusing gpu tensors across gpus perhaps?
Since the issue takes a few minutes to occur without instance groups, it is likely that cross scheduling devices (preprocessing in gpu:0 and then tensorrt on gpu:1) is causing issues and segfaults.

Versions

NVIDIA Release 21.12 (build 30441439)

Any thoughts? Appreciate this in advance.

triton-inference-server / dali_backend