Open AbelDR opened 11 months ago
I'am also experiencing this issue after upgrading Triton. Our team is using Triton in a production environment and it's having a huge impact in our product.
We have made a couple of fixes in r23.12 version in python backend concerning BLS pipeline. See here: https://github.com/triton-inference-server/python_backend/commits/r23.12/ Can you verify if the issue is reproducible with 23.12 release?
We have made a couple of fixes in r23.12 version in python backend concerning BLS pipeline. See here: https://github.com/triton-inference-server/python_backend/commits/r23.12/ Can you verify if the issue is reproducible with 23.12 release?
We tested the new version r23.12 and the same problem occurred.
I made some tests avoiding to use BLS async models converting them to BLS sync models and work without problems. Now, in this way, we have a time inference problem.
In the past (Triton r21.09) we used BLS async models without problems, but now we have to used the last Triton stable version.
Thanks in advance!
Can you share the simple reproducer model repository and client that we can use?
Yes, sure!
Monitoring docker stats
, Triton crashes even without memory overflow problem.
We tested Triton server + client in:
Corei7 12th + RTX3060 + 16GB
Triton server Dockerfile
ARG BUILD_PROD
ARG BUILD_ENV=${BUILD_PROD:+prod}
ARG TRITON_VERSION=2.19.0
ARG TRITON_CONTAINER_VERSION=22.12
ARG CV_VERSION=4.8.1
ARG CUDATOOLKIT_VERSION=11.8.0
ARG CUDATOOLKIT_VERSION
FROM nvcr.io/nvidia/tritonserver:23.12-py3 as prod
ENV TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0+PTX 8.6"
ENV TORCH_NVCC_FLAGS="-Xfatbin -compress-all"
ENV CMAKE_PREFIX_PATH="$(dirname $(which conda))/../"
ENV FORCE_CUDA="1"
ENV DEBIAN_FRONTEND=noninteractive
# ENV TRITON_SERVER_VERSION ${TRITON_VERSION}
# ENV NVIDIA_TRITON_SERVER_VERSION ${TRITON_CONTAINER_VERSION}
ENV PATH /opt/tritonserver/bin:${PATH}
ENV TF_ADJUST_HUE_FUSED=1
ENV TF_ADJUST_SATURATION_FUSED=1
ENV TF_ENABLE_WINOGRAD_NONFUSED=1
ENV TF_AUTOTUNE_THRESHOLD=2
ENV DCGM_VERSION 2.2.9
# Create a user that can be used to run triton as
# non-root. Make sure that this user to given ID 1000. All server
# artifacts copied below are assign to this user.
ENV TRITON_SERVER_USER=triton-server
ENV pplcv_DIR=/root/workspace/pplcv/lib/cmake/ppl
ENV ONNXRUNTIME_DIR=/root/workspace/onnxruntime
ENV LD_LIBRARY_PATH=/root/workspace/onnxruntime/lib:$LD_LIBRARY_PATH
ENV TENSORRT_DIR=/root/workspace/tensorrt
ENV LD_LIBRARY_PATH=/root/workspace/tensorrt/lib:$LD_LIBRARY_PATH
ARG CUDA=11.8
ARG TORCH_VERSION="2.0.0+cu118"
ARG TORCHVISION_VERSION="0.15.0+cu118"
ARG ONNXRUNTIME_VERSION=1.15.1
ARG PPLCV_VERSION=0.7.0
ARG MMCV_VERSION="2.0.1"
ARG MMENGINE_VERSION="0.9.1"
ARG MIM_VERSION="0.3.9"
ARG PYTHONNOUSERSITE=True
RUN userdel tensorrt-server > /dev/null 2>&1 || true && if ! id -u $TRITON_SERVER_USER > /dev/null 2>&1 ; then useradd $TRITON_SERVER_USER; fi && [ `id -u $TRITON_SERVER_USER` -eq 1000 ] && [ `id -g $TRITON_SERVER_USER` -eq 1000 ]
WORKDIR /root
# install packages
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub &&\
apt-get update &&\
apt-get install -y \
rapidjson-dev \
libopencv-dev \
pkg-config \
net-tools \
libsm6 \
libxext6 \
libhdf5-dev \
libgl1-mesa-dev \
libxrender-dev \
libzbar-dev \
libzbar0 \
libgstreamer1.0-dev \
libgstreamer-plugins-base1.0-dev \
libtiff-dev \
libtbb-dev
WORKDIR /opt/workspace
RUN ln /usr/bin/python3 /usr/bin/python
# Extra defensive wiring for CUDA Compat lib
RUN ln -sf ${_CUDA_COMPAT_PATH}/lib.real ${_CUDA_COMPAT_PATH}/lib && echo ${_CUDA_COMPAT_PATH}/lib > /etc/ld.so.conf.d/00-cuda-compat.conf && ldconfig && rm -f ${_CUDA_COMPAT_PATH}/lib
USER root
WORKDIR /models
docker-compose.yml
version: "3.0"
services:
tritonserver:
container_name: tritonserver
build:
context: .
dockerfile: docker/tritonserver/Dockerfile
target: prod
privileged: true
deploy:
resources:
limits:
memory: 8G
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
environment:
- PYTHONUNBUFFERED=no_buffer
- PYTHONDONTWRITEBYTECODE=1
- log-verbose=4
healthcheck:
test: curl --fail triton:8000/v2/health/ready || exit 1
interval: 5s
timeout: 5s
retries: 3
start_period: 5s
ipc: host
network_mode: host
expose:
- 8000
- 8001
- 8002
ports:
- 127.0.0.1:8000:8000
- 127.0.0.1:8002:8002
- 127.0.0.1:8003:8003
ulimits:
stack: 67108864
memlock: -1
volumes:
- ../temporales_triton/modelstemp/issue:/models
command: bash -c "tritonserver --log-verbose=2 --log-error=True --model-repository=/models/"
Models: Google Drive link
Client Dockerfile
FROM nvcr.io/nvidia/tritonserver:23.12-py3-sdk
WORKDIR /app
RUN apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
RUN apt update && apt install -y libb64-dev ffmpeg
COPY requirements_client.txt requirements_client.txt
RUN pip3 install --upgrade pip && pip3 install numpy opencv-python
Clients docker-compose.yml
version: "3.0"
services:
tritonclient_1a:
container_name: tritonclient_1a
build:
context: .
dockerfile: docker/tritonclient/Dockerfile
privileged: true
shm_size: '2gb'
deploy:
resources:
limits:
memory: 512M
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
ulimits:
stack: 67108864
memlock: -1
environment:
- PYTHONUNBUFFERED=no_buffer
- PYTHONDONTWRITEBYTECODE=1
- DISPLAY=$DISPLAY
network_mode: host
# ipc: host
# pid: host
expose:
- 8000
- 8001
- 8002
volumes:
- ../testefinal/images:/dataset
- ./src:/app
command: python test_triton_client_ocr_grpc_async_finaltest_traseira.py --cam traseira
tritonclient_2a:
container_name: tritonclient_2a
build:
context: .
dockerfile: docker/tritonclient/Dockerfile
privileged: true
shm_size: '2gb'
deploy:
resources:
limits:
memory: 512M
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
ulimits:
stack: 67108864
memlock: -1
environment:
- PYTHONUNBUFFERED=no_buffer
- PYTHONDONTWRITEBYTECODE=1
- DISPLAY=$DISPLAY
network_mode: host
ipc: host
pid: host
expose:
- 8000
- 8001
- 8002
volumes:
- ../testefinal/images:/dataset
- ./src:/app
command: python test_triton_client_ocr_grpc_async_finaltest_traseira.py --cam frontal
tritonclient_3a:
container_name: tritonclient_3a
build:
context: .
dockerfile: docker/tritonclient/Dockerfile
privileged: true
shm_size: '2gb'
deploy:
resources:
limits:
memory: 512M
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
ulimits:
stack: 67108864
memlock: -1
environment:
- PYTHONUNBUFFERED=no_buffer
- PYTHONDONTWRITEBYTECODE=1
- DISPLAY=$DISPLAY
network_mode: host
ipc: host
pid: host
expose:
- 8000
- 8001
- 8002
volumes:
- ../testefinal/images:/dataset
- ./src:/app
command: python test_triton_client_ocr_grpc_async_finaltest_traseira.py --cam base
tritonclient_1b:
container_name: tritonclient_1b
build:
context: .
dockerfile: docker/tritonclient/Dockerfile
privileged: true
shm_size: '2gb'
deploy:
resources:
limits:
memory: 512M
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
ulimits:
stack: 67108864
memlock: -1
environment:
- PYTHONUNBUFFERED=no_buffer
- PYTHONDONTWRITEBYTECODE=1
- DISPLAY=$DISPLAY
network_mode: host
# ipc: host
# pid: host
expose:
- 8000
- 8001
- 8002
volumes:
- ../testefinal/images:/dataset
- ./src:/app
command: python test_triton_client_ocr_grpc_async_finaltest_traseira.py --cam traseira
tritonclient_2b:
container_name: tritonclient_2b
build:
context: .
dockerfile: docker/tritonclient/Dockerfile
privileged: true
shm_size: '2gb'
deploy:
resources:
limits:
memory: 512M
reservations:
devices:
- driver: nvidia
capabilities: [ gpu ]
ulimits:
stack: 67108864
memlock: -1
environment:
- PYTHONUNBUFFERED=no_buffer
- PYTHONDONTWRITEBYTECODE=1
- DISPLAY=$DISPLAY
network_mode: host
ipc: host
pid: host
expose:
- 8000
- 8001
- 8002
volumes:
- ../testefinal/images:/dataset
- ./src:/app
command: python test_triton_client_ocr_grpc_async_finaltest_traseira.py --cam frontal
Python client script:
import os
import argparse
import numpy as np
import os
import sys
from builtins import range
import json
import tritonclient.grpc as grpcclient
from tritonclient import utils
from functools import partial
# import tritonclient.utils.cuda_shared_memory as shm
# import tritonclient.utils.shared_memory as shm
from tritonclient.utils import InferenceServerException
import cv2
import time
from glob import glob
from ctypes import *
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Description of your program")
parser.add_argument("--cam", type=str, required=True)
args = parser.parse_args()
ip3080 = "10.0.0.75:8001"
iplocal = "0.0.0.0:8001"
TRITON_GRPC_ADDR = iplocal
try:
triton_client = grpcclient.InferenceServerClient(url=TRITON_GRPC_ADDR, verbose=False)
except Exception as e:
print("channel creation failed: " + str(e))
sys.exit(1)
# triton_client.unregister_system_shared_memory()
model_name = "cam_" + str(args.cam)
model_version = "latest"
input_data = {}
########ocr######
ocr = {}
roi_ocr = {}
roi_ocr ['x'] =0.61
roi_ocr ['y'] =0.68
roi_ocr ['width'] =0.739
roi_ocr ['height'] =0.232
ocr['roi']= roi_ocr
########barcode######
barcode = {}
roi_barcode = {}
roi_barcode ['x'] =0.389
roi_barcode ['y'] =0.316
roi_barcode ['width'] =0.38
roi_barcode ['height'] =0.63
barcode['roi']= roi_barcode
input_data['factory']= 'V'
input_data['ocr']= ocr
input_data['barcode']= barcode
input_data_json = np.array([ord(i) for i in json.dumps(input_data)],dtype=np.uint8)
# img = cv2.imread('/dataset/base.jpg')
file_list = glob('/dataset/*.jpg')
file_list.sort()
file_list=['/dataset/lenafake.jpg']
img = cv2.imread(file_list[0])
contador = 0
while True:
for imgpath in file_list:
contador = contador + 1
# print('counter:',contador)
# img = cv2.imread(imgpath)
t0 = time.time()
inputs = []
outputs = []
input0_data = img
inputs.append(grpcclient.InferInput("IMAGE_IN", list(img.shape), "UINT8"))
inputs.append(grpcclient.InferInput("RULES", [input_data_json.shape[0]], "UINT8"))
inputs[0].set_data_from_numpy(img)
inputs[1].set_data_from_numpy(input_data_json)
outputs.append(grpcclient.InferRequestedOutput("OUTPUT0"))
def callback(user_data, result, error):
if error:
user_data.append(error)
else:
user_data.append(result)
user_data = []
# Inference call
triton_client.async_infer(
model_name=model_name,
inputs=inputs,
callback=partial(callback, user_data),
outputs=outputs,
client_timeout=0.1,
)
# print('len',len(user_data) )
time_out = 0.25
time.sleep(time_out)
if len(user_data) == 1:
# Check for the errors
if type(user_data[0]) == InferenceServerException:
print(user_data[0])
# sys.exit(1)
# output0_data = user_data[0].as_numpy("OUTPUT0")
print('intime')
else:
print('delay')
t1 = time.time()
print('fps',t1-t0,int(1/(t1-t0)))
image test: lenafake.jpg
Thank you for the reproducer. Tanmay created a ticket to track this bug earlier.
Ref: 6021
Hello, any advancement on this ? Been a major issue on our production lately
any progress on this?
Getting this error on tritonserver:24.07-py3
in async BLS layer
Signal (11) received.
0# 0x000055CED3DDD80D in tritonserver
1# 0x00007F58C1914520 in /usr/lib/x86_64-linux-gnu/libc.so.6
2# 0x00007F58B0519D64 in /opt/tritonserver/backends/python/libtriton_python.so
3# 0x00007F58B04D746C in /opt/tritonserver/backends/python/libtriton_python.so
4# 0x00007F58B04E6D0C in /opt/tritonserver/backends/python/libtriton_python.so
5# 0x00007F58B04E790F in /opt/tritonserver/backends/python/libtriton_python.so
6# 0x00007F58B04EADDD in /opt/tritonserver/backends/python/libtriton_python.so
7# 0x00007F58C196BEE8 in /usr/lib/x86_64-linux-gnu/libc.so.6
8# 0x00007F58B04D3ACB in /opt/tritonserver/backends/python/libtriton_python.so
9# 0x00007F58B0501342 in /opt/tritonserver/backends/python/libtriton_python.so
10# 0x00007F58B04F296C in /opt/tritonserver/backends/python/libtriton_python.so
11# 0x00007F58B04F2EDD in /opt/tritonserver/backends/python/libtriton_python.so
12# 0x00007F58B04E82A4 in /opt/tritonserver/backends/python/libtriton_python.so
13# 0x00007F58C1966AC3 in /usr/lib/x86_64-linux-gnu/libc.so.6
14# clone in /usr/lib/x86_64-linux-gnu/libc.so.6
Description Triton Sever crashed after some period of time running inferences using Python Backend models. The Python backend models are running TensorRT models with mmdeploy python api .
Triton Information Triton Version = 2.40.0 Triton Container Version = 23.11
Are you using the Triton container or did you build it yourself? Built by compose.py
python3 compose.py --backend python --container-version 23.11
To Reproduce We are running two python models with BLS, the first one pre-process the image, and the other running an OCR pipeline (Detection+Recognition) . The Golang client is using sync gprc request to send an input image along with json structured data.
Preprocessing model pbtxt:
OCR Pipeline model pbtxt:
Crash logs
This stress test started at 01:00 am to 06:53 am with a total of 250K inferences approximately. We got the same problem with different number of inferences and period of time. In the past, this problem didn't happen using Triton 22.12. We update Triton to improve robustness and reliability.
Expected behavior We are expecting a robust inference sever that don't crash long runs.