triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.99k stars 1.44k forks source link

Triton Server Crash with Signal (11) #6720

Open AbelDR opened 8 months ago

AbelDR commented 8 months ago

Description Triton Sever crashed after some period of time running inferences using Python Backend models. The Python backend models are running TensorRT models with mmdeploy python api .

Triton Information Triton Version = 2.40.0 Triton Container Version = 23.11

Are you using the Triton container or did you build it yourself? Built by compose.py

python3 compose.py --backend python --container-version 23.11

To Reproduce We are running two python models with BLS, the first one pre-process the image, and the other running an OCR pipeline (Detection+Recognition) . The Golang client is using sync gprc request to send an input image along with json structured data.

Preprocessing model pbtxt:

backend: "python"
parameters: {
    key: "EXECUTION_ENV_PATH"
    value: {string_value: "/ksim/anaconda3/envs/cvcuda"}
}
max_batch_size: 0
input [
    {
        name: "IMAGE_IN"
        data_type: TYPE_UINT8
        format: FORMAT_NHWC
        dims: [ -1, -1, 3 ]
    },
    {
        name: "RULES"
        data_type: TYPE_UINT8
        dims: [ -1 ]
    }

]
output [
    {
        name: "OUTPUT0"
        data_type: TYPE_STRING
        dims: [ -1 ]
    }
]

OCR Pipeline model pbtxt:

backend: "python"
parameters: {
    key: "EXECUTION_ENV_PATH"
    value: {string_value: "/ksim/anaconda3/envs/cvcuda"}
}
max_batch_size: 0
input [
    {
        name: "INPUT0"
        data_type: TYPE_UINT8
        format: FORMAT_NHWC
        dims: [ -1, -1, 3 ]
    },
    {
        name: "RULES"
        data_type: TYPE_UINT8
        dims: [ -1 ]
    }
]
output [
    {
        name: "OUTPUT0"
        data_type: TYPE_UINT8
        dims: [ 1]
    },
    {
        name: "OUTPUT1"
        data_type: TYPE_STRING
        dims: [ -1]
    }
]
instance_group [
    {
      count: 1
      kind: KIND_GPU
    }
]`

Crash logs

This stress test started at 01:00 am to 06:53 am with a total of 250K inferences approximately. We got the same problem with different number of inferences and period of time. In the past, this problem didn't happen using Triton 22.12. We update Triton to improve robustness and reliability.

triton-staging  | Signal (11) received.
triton-staging  | Signal (11) received.
triton-staging  |  0# 0x000055A0FF52C67D in tritonserver
triton-staging  |  1# 0x00007FCC5718D520 in /usr/lib/x86_64-linux-gnu/libc.so.6
triton-staging  |  2# 0x00007FCC572EBB30 in /usr/lib/x86_64-linux-gnu/libc.so.6
triton-staging  |  3# 0x00007FCC43BBF4CE in /opt/tritonserver/backends/python/libtriton_python.so
triton-staging  |  4# 0x00007FCC43BCE0E8 in /opt/tritonserver/backends/python/libtriton_python.so
triton-staging  |  5# 0x00007FCC43BD50D0 in /opt/tritonserver/backends/python/libtriton_python.so
triton-staging  |  6# 0x00007FCC43B4E429 in /opt/tritonserver/backends/python/libtriton_python.so
triton-staging  |  7# 0x00007FCC43B51C08 in /opt/tritonserver/backends/python/libtriton_python.so
triton-staging  |  8# 0x00007FCC43B58406 in /opt/tritonserver/backends/python/libtriton_python.so
triton-staging  |  9# TRITONBACKEND_ModelInstanceExecute in /opt/tritonserver/backends/python/libtriton_python.so
triton-staging  | 10# 0x00007FCC57B82D04 in /opt/tritonserver/bin/../lib/libtritonserver.so
triton-staging  | 11# 0x00007FCC57B8306B in /opt/tritonserver/bin/../lib/libtritonserver.so
triton-staging  | 12# 0x00007FCC57C93D2D in /opt/tritonserver/bin/../lib/libtritonserver.so
triton-staging  | 13# 0x00007FCC57B87494 in /opt/tritonserver/bin/../lib/libtritonserver.so
triton-staging  | 14# 0x00007FCC5744F253 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6
triton-staging  | 15# 0x00007FCC571DFAC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
triton-staging  | 16# 0x00007FCC57271660 in /usr/lib/x86_64-linux-gnu/libc.so.6

issue01 issue02 issue03

Expected behavior We are expecting a robust inference sever that don't crash long runs.

PauloFavero commented 8 months ago

I'am also experiencing this issue after upgrading Triton. Our team is using Triton in a production environment and it's having a huge impact in our product.

tanmayv25 commented 8 months ago

We have made a couple of fixes in r23.12 version in python backend concerning BLS pipeline. See here: https://github.com/triton-inference-server/python_backend/commits/r23.12/ Can you verify if the issue is reproducible with 23.12 release?

AbelDR commented 7 months ago

We have made a couple of fixes in r23.12 version in python backend concerning BLS pipeline. See here: https://github.com/triton-inference-server/python_backend/commits/r23.12/ Can you verify if the issue is reproducible with 23.12 release?

We tested the new version r23.12 and the same problem occurred. I made some tests avoiding to use BLS async models converting them to BLS sync models and work without problems. Now, in this way, we have a time inference problem.
In the past (Triton r21.09) we used BLS async models without problems, but now we have to used the last Triton stable version.

Thanks in advance!

tanmayv25 commented 7 months ago

Can you share the simple reproducer model repository and client that we can use?

AbelDR commented 7 months ago

Yes, sure! Monitoring docker stats, Triton crashes even without memory overflow problem. We tested Triton server + client in: Corei7 12th + RTX3060 + 16GB

Triton Server

Triton server Dockerfile

ARG BUILD_PROD
ARG BUILD_ENV=${BUILD_PROD:+prod}

ARG TRITON_VERSION=2.19.0
ARG TRITON_CONTAINER_VERSION=22.12
ARG CV_VERSION=4.8.1
ARG CUDATOOLKIT_VERSION=11.8.0
ARG CUDATOOLKIT_VERSION

FROM nvcr.io/nvidia/tritonserver:23.12-py3 as prod

ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX 8.6"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"
ENV FORCE_CUDA="1"
ENV DEBIAN_FRONTEND=noninteractive
# ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
# ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
ENV PATH /opt/tritonserver/bin:${PATH}
ENV TF_ADJUST_HUE_FUSED=1
ENV TF_ADJUST_SATURATION_FUSED=1
ENV TF_ENABLE_WINOGRAD_NONFUSED=1
ENV TF_AUTOTUNE_THRESHOLD=2
ENV DCGM_VERSION 2.2.9
# Create a user that can be used to run triton as
# non-root. Make sure that this user to given ID 1000. All server
# artifacts copied below are assign to this user.
ENV TRITON_SERVER_USER=triton-server
ENV pplcv_DIR=/root/workspace/pplcv/lib/cmake/ppl
ENV ONNXRUNTIME_DIR=/root/workspace/onnxruntime
ENV LD_LIBRARY_PATH=/root/workspace/onnxruntime/lib:$LD_LIBRARY_PATH
ENV TENSORRT_DIR=/root/workspace/tensorrt
ENV LD_LIBRARY_PATH=/root/workspace/tensorrt/lib:$LD_LIBRARY_PATH

ARG CUDA=11.8
ARG TORCH_VERSION="2.0.0+cu118"
ARG TORCHVISION_VERSION="0.15.0+cu118"
ARG ONNXRUNTIME_VERSION=1.15.1
ARG PPLCV_VERSION=0.7.0
ARG MMCV_VERSION="2.0.1"
ARG MMENGINE_VERSION="0.9.1"
ARG MIM_VERSION="0.3.9"
ARG PYTHONNOUSERSITE=True

RUN userdel tensorrt-server > /dev/null 2>&1 || true &&     if ! id -u $TRITON_SERVER_USER > /dev/null 2>&1 ; then         useradd $TRITON_SERVER_USER;     fi &&     [ `id -u $TRITON_SERVER_USER` -eq 1000 ] &&     [ `id -g $TRITON_SERVER_USER` -eq 1000 ]

WORKDIR /root

# install packages
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub &&\
    apt-get update &&\
    apt-get install -y \
    rapidjson-dev \
    libopencv-dev \
    pkg-config \
    net-tools \
    libsm6 \
    libxext6 \
    libhdf5-dev \
    libgl1-mesa-dev \
    libxrender-dev \
    libzbar-dev \
    libzbar0 \
    libgstreamer1.0-dev \
    libgstreamer-plugins-base1.0-dev \
    libtiff-dev \
    libtbb-dev

WORKDIR /opt/workspace

RUN ln /usr/bin/python3 /usr/bin/python

# Extra defensive wiring for CUDA Compat lib
RUN ln -sf ${_CUDA_COMPAT_PATH}/lib.real ${_CUDA_COMPAT_PATH}/lib  && echo ${_CUDA_COMPAT_PATH}/lib > /etc/ld.so.conf.d/00-cuda-compat.conf  && ldconfig  && rm -f ${_CUDA_COMPAT_PATH}/lib 

USER root

WORKDIR /models

docker-compose.yml

version: "3.0"
services:
  tritonserver:
    container_name: tritonserver
    build:
      context: .
      dockerfile: docker/tritonserver/Dockerfile
      target: prod
    privileged: true
    deploy:
      resources:
        limits:
          memory: 8G
        reservations:
          devices:
            - driver: nvidia
              capabilities: [ gpu ]
    environment:
      - PYTHONUNBUFFERED=no_buffer
      - PYTHONDONTWRITEBYTECODE=1
      - log-verbose=4
    healthcheck:
      test: curl --fail  triton:8000/v2/health/ready || exit 1
      interval: 5s
      timeout: 5s
      retries: 3
      start_period: 5s
    ipc: host
    network_mode: host
    expose:
      - 8000
      - 8001
      - 8002
    ports:
      - 127.0.0.1:8000:8000
      - 127.0.0.1:8002:8002
      - 127.0.0.1:8003:8003
    ulimits:
      stack: 67108864
      memlock: -1

    volumes:
      - ../temporales_triton/modelstemp/issue:/models

    command: bash -c "tritonserver --log-verbose=2 --log-error=True --model-repository=/models/"

Models: Google Drive link

Client

Client Dockerfile

FROM  nvcr.io/nvidia/tritonserver:23.12-py3-sdk

WORKDIR /app

RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub

RUN apt update && apt install -y libb64-dev ffmpeg

COPY requirements_client.txt requirements_client.txt

RUN pip3 install --upgrade pip && pip3 install numpy opencv-python

Clients docker-compose.yml


version: "3.0"
services:
  tritonclient_1a:
      container_name: tritonclient_1a
      build:
        context: .
        dockerfile: docker/tritonclient/Dockerfile
      privileged: true
      shm_size: '2gb'
      deploy:
        resources:
          limits:
            memory: 512M
          reservations:
            devices:
              - driver: nvidia
                capabilities: [ gpu ]
      ulimits:
        stack: 67108864
        memlock: -1
      environment:
        - PYTHONUNBUFFERED=no_buffer
        - PYTHONDONTWRITEBYTECODE=1   
        - DISPLAY=$DISPLAY   
      network_mode: host
      # ipc: host
      # pid: host
      expose:
      - 8000
      - 8001
      - 8002

      volumes:
        - ../testefinal/images:/dataset
        - ./src:/app

      command: python test_triton_client_ocr_grpc_async_finaltest_traseira.py --cam traseira

  tritonclient_2a:
      container_name: tritonclient_2a
      build:
        context: .
        dockerfile: docker/tritonclient/Dockerfile
      privileged: true
      shm_size: '2gb'
      deploy:
        resources:
          limits:
            memory: 512M
          reservations:
            devices:
              - driver: nvidia
                capabilities: [ gpu ]
      ulimits:
        stack: 67108864
        memlock: -1
      environment:
        - PYTHONUNBUFFERED=no_buffer
        - PYTHONDONTWRITEBYTECODE=1   
        - DISPLAY=$DISPLAY   
      network_mode: host
      ipc: host
      pid: host
      expose:
      - 8000
      - 8001
      - 8002

      volumes:
        - ../testefinal/images:/dataset
        - ./src:/app

      command: python test_triton_client_ocr_grpc_async_finaltest_traseira.py --cam frontal

  tritonclient_3a:
      container_name: tritonclient_3a
      build:
        context: .
        dockerfile: docker/tritonclient/Dockerfile
      privileged: true
      shm_size: '2gb'
      deploy:
        resources:
          limits:
            memory: 512M
          reservations:
            devices:
              - driver: nvidia
                capabilities: [ gpu ]
      ulimits:
        stack: 67108864
        memlock: -1
      environment:
        - PYTHONUNBUFFERED=no_buffer
        - PYTHONDONTWRITEBYTECODE=1   
        - DISPLAY=$DISPLAY   
      network_mode: host
      ipc: host
      pid: host
      expose:
      - 8000
      - 8001
      - 8002

      volumes:
        - ../testefinal/images:/dataset
        - ./src:/app

      command: python test_triton_client_ocr_grpc_async_finaltest_traseira.py --cam base

  tritonclient_1b:
      container_name: tritonclient_1b
      build:
        context: .
        dockerfile: docker/tritonclient/Dockerfile
      privileged: true
      shm_size: '2gb'
      deploy:
        resources:
          limits:
            memory: 512M
          reservations:
            devices:
              - driver: nvidia
                capabilities: [ gpu ]
      ulimits:
        stack: 67108864
        memlock: -1
      environment:
        - PYTHONUNBUFFERED=no_buffer
        - PYTHONDONTWRITEBYTECODE=1   
        - DISPLAY=$DISPLAY   
      network_mode: host
      # ipc: host
      # pid: host
      expose:
      - 8000
      - 8001
      - 8002

      volumes:
        - ../testefinal/images:/dataset
        - ./src:/app

      command: python test_triton_client_ocr_grpc_async_finaltest_traseira.py --cam traseira

  tritonclient_2b:
      container_name: tritonclient_2b
      build:
        context: .
        dockerfile: docker/tritonclient/Dockerfile
      privileged: true
      shm_size: '2gb'
      deploy:
        resources:
          limits:
            memory: 512M
          reservations:
            devices:
              - driver: nvidia
                capabilities: [ gpu ]
      ulimits:
        stack: 67108864
        memlock: -1
      environment:
        - PYTHONUNBUFFERED=no_buffer
        - PYTHONDONTWRITEBYTECODE=1   
        - DISPLAY=$DISPLAY   
      network_mode: host
      ipc: host
      pid: host
      expose:
      - 8000
      - 8001
      - 8002

      volumes:
        - ../testefinal/images:/dataset
        - ./src:/app

      command: python test_triton_client_ocr_grpc_async_finaltest_traseira.py --cam frontal

Python client script:

import os
import argparse
import numpy as np
import os
import sys
from builtins import range
import json
import tritonclient.grpc as grpcclient
from tritonclient import utils
from functools import partial
# import tritonclient.utils.cuda_shared_memory as shm
# import tritonclient.utils.shared_memory as shm
from tritonclient.utils import InferenceServerException
import cv2
import time
from glob import glob
from ctypes import *

if __name__ == "__main__":

    parser = argparse.ArgumentParser(description="Description of your program")
    parser.add_argument("--cam", type=str, required=True)
    args = parser.parse_args()

    ip3080 = "10.0.0.75:8001"
    iplocal = "0.0.0.0:8001"

    TRITON_GRPC_ADDR = iplocal

    try:
        triton_client = grpcclient.InferenceServerClient(url=TRITON_GRPC_ADDR, verbose=False)
    except Exception as e:
        print("channel creation failed: " + str(e))
        sys.exit(1)

    # triton_client.unregister_system_shared_memory()

    model_name = "cam_" + str(args.cam)
    model_version = "latest"

    input_data = {}

    ########ocr######
    ocr = {}
    roi_ocr = {}    
    roi_ocr ['x'] =0.61
    roi_ocr ['y'] =0.68
    roi_ocr ['width'] =0.739
    roi_ocr ['height'] =0.232
    ocr['roi']= roi_ocr

    ########barcode######
    barcode = {}
    roi_barcode = {}    
    roi_barcode ['x'] =0.389
    roi_barcode ['y'] =0.316
    roi_barcode ['width'] =0.38
    roi_barcode ['height'] =0.63
    barcode['roi']= roi_barcode

    input_data['factory']= 'V'
    input_data['ocr']= ocr
    input_data['barcode']= barcode

    input_data_json = np.array([ord(i) for i in json.dumps(input_data)],dtype=np.uint8)

    # img = cv2.imread('/dataset/base.jpg')
    file_list = glob('/dataset/*.jpg')

    file_list.sort()

    file_list=['/dataset/lenafake.jpg']

    img = cv2.imread(file_list[0])
    contador = 0

    while True:

        for imgpath in file_list:
            contador = contador + 1 
            # print('counter:',contador)

            # img = cv2.imread(imgpath)            

            t0 = time.time()

            inputs = []
            outputs = []
            input0_data = img
            inputs.append(grpcclient.InferInput("IMAGE_IN", list(img.shape), "UINT8"))   
            inputs.append(grpcclient.InferInput("RULES", [input_data_json.shape[0]], "UINT8"))

            inputs[0].set_data_from_numpy(img)
            inputs[1].set_data_from_numpy(input_data_json)

            outputs.append(grpcclient.InferRequestedOutput("OUTPUT0"))

            def callback(user_data, result, error):
                if error:
                    user_data.append(error)
                else:
                    user_data.append(result)

            user_data = []

            # Inference call
            triton_client.async_infer(
                model_name=model_name,
                inputs=inputs,
                callback=partial(callback, user_data),
                outputs=outputs,
                client_timeout=0.1,
            )
            # print('len',len(user_data) )
            time_out = 0.25
            time.sleep(time_out)

            if len(user_data) == 1:
                # Check for the errors
                if type(user_data[0]) == InferenceServerException:
                    print(user_data[0])
                    # sys.exit(1)

                # output0_data = user_data[0].as_numpy("OUTPUT0")

                print('intime')

            else:
                print('delay')

            t1 = time.time()
            print('fps',t1-t0,int(1/(t1-t0)))

image test: lenafake.jpg

dyastremsky commented 6 months ago

Thank you for the reproducer. Tanmay created a ticket to track this bug earlier.

Ref: 6021

sboudouk commented 2 months ago

Hello, any advancement on this ? Been a major issue on our production lately

RamonPessoa commented 1 week ago

any progress on this?