triton-inference-server / dali_backend

The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
MIT License
123 stars 29 forks source link

Multi-GPU Configuration with dali Signal (11)s on triton 21.12 during ensemble processing #116

Open natel9178 opened 2 years ago

natel9178 commented 2 years ago

Hi!

I'm trying to use dali for preprocessing images to send to yolo. I have a two GPU system, that after a few minutes running dali, segfaults and dies. The Dali step is run in an triton ensemble model that sends data to a yolo model compiled to tensorrt. The stacktrace seems to indicate the problem is within dali.

Repro:

The preprocessing code is the following:

import nvidia.dali as dali

@dali.pipeline_def(batch_size=64, num_threads=4, device_id=0)
def pipe():
    images = dali.fn.external_source(
        device="gpu", name="DALI_INPUT_0", dtype=dali.types.UINT8)
    images = dali.fn.color_space_conversion(
        images, image_type=dali.types.BGR, output_type=dali.types.RGB, device='gpu')
    images = dali.fn.resize(images, mode="not_larger",
                            resize_x=640, resize_y=384, device='gpu')
    images = dali.fn.crop(images, crop_w=640, crop_h=384, crop_pos_x=0, crop_pos_y=0,
                          fill_values=114, out_of_bounds_policy="pad", device='gpu')
    images = dali.fn.transpose(images, perm=[2, 0, 1], device='gpu')
    images = dali.fn.cast(images, dtype=dali.types.FLOAT, device='gpu')
    return images

pipe().serialize(filename="1/model.dali")

and config.pbtxt

name: "preprocessbgr"
backend: "dali"
max_batch_size: 64 
input [
{
    name: "DALI_INPUT_0"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3 ]
}
]

output [
{
    name: "OUTPUT_0"
    data_type: TYPE_FP32
    dims: [ 3, 384, 640 ]
}
]
dynamic_batching { }

There are about 100fps (1 batch size with 32 concurrency) of images being sent to triton, causing it to throw this error after several rounds of processing. The error reproduces after a few minutes:

Signal (11) received.
 0# 0x00005572DF1FBBD9 in tritonserver
 1# 0x00007F6BFF552210 in /usr/lib/x86_64-linux-gnu/libc.so.6
 2# 0x00007F6BF5602D40 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 3# 0x00007F6BF57383E3 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 4# 0x00007F6BF585EB02 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 5# 0x00007F6BF55BF2E3 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 6# 0x00007F6BF55BFAC4 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 7# 0x00007F6BF55C1BD5 in /usr/local/cuda/compat/lib.real/libcuda.so.1
 8# 0x00007F6BF562FAAE in /usr/local/cuda/compat/lib.real/libcuda.so.1
 9# 0x00007F6BC80CF0D9 in /opt/tritonserver/backends/dali/libtriton_dali.so
10# 0x00007F6BC809EFED in /opt/tritonserver/backends/dali/libtriton_dali.so
11# 0x00007F6BC80F3B65 in /opt/tritonserver/backends/dali/libtriton_dali.so
12# 0x00007F6BC8098130 in /opt/tritonserver/backends/dali/libtriton_dali.so
13# dali::ThreadPool::ThreadMain(int, int, bool) in /opt/tritonserver/backends/dali/dali/libdali.so
14# 0x00007F6ABDAE526F in /opt/tritonserver/backends/dali/dali/libdali.so
15# 0x00007F6BFFDBE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0
16# clone in /usr/lib/x86_64-linux-gnu/libc.so.6

Also, interestingly enough, running this with

instance_group [
    {
        count: 1
        kind: KIND_GPU
        gpus: [ 0 ]
    }
]

Causes Signal (11) to show up faster (seconds) rather than minutes during ensembling over two GPUs.

Other things I've tried

I've tried multiple changes, such as

  1. calling .gpu() and setting the external source to cpu, which causes the signal 11 to show up later.
  2. single GPU instance makes dali play nicely and does not crash.

Theories

  1. There is some kind of weirdness with the dali backend confusing gpu tensors across gpus perhaps?
  2. Since the issue takes a few minutes to occur without instance groups, it is likely that cross scheduling devices (preprocessing in gpu:0 and then tensorrt on gpu:1) is causing issues and segfaults.

Versions

NVIDIA Release 21.12 (build 30441439)

Any thoughts? Appreciate this in advance.

szalpal commented 2 years ago

Hi @natel9178 ! Thanks for thorough description of the problem. If I understand correctly, you are running 32 parallel models here. Additionally I see these in the stacktrace:

13# dali::ThreadPool::ThreadMain(int, int, bool) in /opt/tritonserver/backends/dali/dali/libdali.so
15# 0x00007F6BFFDBE609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0

This would suggest a problem with CPU threads used. Could you try to reduce num_threads=1 in DALI pipeline definition and see, if this helps? Generally, CPU operations in DALI use thread-per-sample mapping. So if you have batch_size=1 there won't be any difference, still only one thread will be used.

If this does not help, could you support us with a core dump or a repro, that we can run on our side?