Intermittent Error with Python BLS Backend Model

sungyeon-neubla commented 1 year ago

Description I am experiencing intermittent error with Python BLS Backend Models where it complains that it failed to increase the share memory pool size.

Here is the exact error message from Triton:

Failed to increase the shared memory pool size for key 'triton_python_backend_shm_region_7' to 1073741824 bytes. If you are running Triton inside docker, use '--shm-size' flag to control the shared memory region size. Error: Connection timed out

Triton Information What version of Triton are you using?

using nvcr.io/nvidia/tritonserver:22.07-py3 docker image

Are you using the Triton container or did you build it yourself?

I am running the Triton Inference Server with Docker where the docker image looks like the following:

FROM nvcr.io/nvidia/tritonserver:22.07-py3

# 2022.05.13: CUDA Linux Repo GPG Key Rotation
COPY cuda-keyring_1.0-1_all.deb /cuda-keyring_1.0-1_all.deb
RUN apt-key del 7fa2af80
RUN dpkg -i /cuda-keyring_1.0-1_all.deb
RUN rm -f /etc/apt/sources.list.d/cuda*.list && \
    rm -f /etc/apt/sources.list.d/nvidia-ml.list

RUN apt-get update
RUN apt-get install ffmpeg libsm6 libxext6  -y

# Install Python packages
RUN pip3 install --upgrade pip

COPY ./requirements.txt /requirements.txt
RUN pip3 install -r /requirements.txt && rm -f /requirements.txt

# Install PyTorch
RUN pip3 uninstall torch torchvision torchaudio -q -y
RUN pip3 install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 \
    -f https://download.pytorch.org/whl/cu113/torch_stable.html

# Install mmdetection
RUN MMCV_WITH_OPS=1 FORCE_CUDA=1 pip3 install mmcv-full==1.6.2 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10/index.html
RUN pip3 install openmim
RUN pip3 install mmdet

# Install mmpose
RUN pip3 install mmpose

Triton Inference Server is brought up using the following command line: docker run --gpus '"device=2"' --rm --net=host --shm-size=16g -v /<my-home>/triton/models:/models test-triton-server:latest tritonserver --model-repository=/models

To Reproduce Steps to reproduce the behavior. Make inference request to the Python BLS model.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

The Python BLS model does the following:

Accepts frames of video inputs
Runs Object Tracking Python Backend Model for result
Pre-processes the tracking result for 2D Pose Estimation model (Resnet ONNX model on Triton)
Runs 2D Pose Estimation model (Resnet ONNX model on Triton) for 2D kpts for the detected people from tracking
Post-processes the 2D ktps and pre-processes for the 3D Pose Estimation model (mhformer ONNX model on Triton)
Runs 3D Pose Estimation model (mhformer ONNX model on Triton)
Returns the 3D kpts and bounding boxes of the detected people for each frames

And here is the model configuration:

name: "bls_3d_pose_estimation"
backend: "python"
max_batch_size: 256
input [
{
    name: "input"
    data_type: TYPE_UINT8
    format: FORMAT_NHWC
    dims: [ -1, -1, 3 ]
  }
]

output [
  {
    name: "ids"
    data_type: TYPE_INT64
    dims: [-1]
  },
  {
    name: "keypoints"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1 ]
  },
  {
    name: "frame_idx"
    data_type: TYPE_INT64
    dims: [-1, -1]
  }
]

version_policy: { all { }}

Expected behavior A clear and concise description of what you expected to happen.

--shm-size=16g should be enough and the error should not occur.

oandreeva-nv commented 1 year ago

@sungyeon-neubla Thank you for reporting this issue!

I see you are running 22.07 version of the container. Is it possible to try the newer versions?

Alternatively, is it possible to use --ipc=host tag for your docker run command? If yes, could you please try it and see if this helps.

tanmayv25 commented 1 year ago

Closing because of the lack of activity. @sungyeon-neubla please share the answers to the above questions if you are still running into the problems.

triton-inference-server / server

Intermittent Error with Python BLS Backend Model #5701