Error while loading shared libraries: libdcgm.so.2

AhmetTelceken commented 2 years ago

Description I succesfully installed triton inference server to my local computer. I tried to run the server but i got the error libdcgm.so.2 cannot opened. I am sure that i succesfully installed nvdia dcgm as well. Do you have an idea why i am getting the error?

/opt/tritonserver2/bin/tritonserver: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory

Triton Information Triton 2.17.0 (container 21.12) ubuntu 20.04 cuda 11.5.0

Are you using the Triton container or did you build it yourself? - Did myself.

Expected behavior Starting corectly.

rmccorm4 commented 2 years ago

Hi @AhmetTelceken ,

If you have already installed DCGM correctly, then you should make sure the required library libdcgm.so.2 in example above is findable by Triton (ex: the folder it lives in is included in LD_LIBRARY_PATH).

Example of finding libdcgm.so.2:

# May be faster to first check "find /usr -name libdcgm*"
$ find / -name libdcgm*
...
/usr/lib/x86_64-linux-gnu/libdcgm.so.2

Example of adding its directory (from above) to the searchable path:

# Assuming libdcgm.so.2 lives in /usr/lib/x86_64-linux-gnu like above
$ export LD_LIBRARY_PATH="/usr/lib/x86_64-linux-gnu/:${LD_LIBRARY_PATH}"

jbkyang-nvi commented 1 year ago

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue

ApoorveK commented 1 year ago

@jbkyang-nvi actually this issue still persist, if you built images from the code repository, but not if the docker images is being pulled from NGC

jbkyang-nvi commented 1 year ago

@jbkyang-nvi actually this issue still persist, if you built images from the code repository, but not if the docker images is being pulled from NGC

@ApoorveK Can you provide exact steps to reproduce this issue. What machine/OS are yo using? Did you run build.py? Are you running Triton in a docker container? How did you build DCGM? Can you find libdcgm.so.* in the container/machine you are running Triton?

ApoorveK commented 1 year ago

@jbkyang-nvi hello, sorry for late reply, so for exact replication-> while building from repository for container version 22.12 after installing TF version:2.1.0 and transformers version 2.11.0, if you start the triton inference server you will get this issue that libdcgm.so.2.* files are missing. But this issue got solved if i used lower version of Triton container 22.10, with same TF versions and transformer versions.

0xDing commented 1 year ago

@jbkyang-nvi hello, sorry for late reply, so for exact replication-> while building from repository for container version 22.12 after installing TF version:2.1.0 and transformers version 2.11.0, if you start the triton inference server you will get this issue that libdcgm.so.2.* files are missing. But this issue got solved if i used lower version of Triton container 22.10, with same TF versions and transformer versions. @jbkyang-nvi same issue when
FROM    nvcr.io/nvidia/tritonserver:22.12-pyt-python-py3
WORKDIR /app
ENV TORCH_NVCC_FLAGS    "-D__CUDA_NO_HALF_OPERATORS__"
ENV TORCH_CUDA_ARCH_LIST    "8.0 8.6"
ENV SAFETENSORS_FAST_GPU    1
RUN apt-get update \
&& apt-get upgrade -y \
&& apt-get install -y \
bash \
build-essential \
libaio-dev \
libaio1 \
libsndfile-dev \
libcupti-dev \
libjpeg-dev \
libpng-dev \
libwebp-dev
RUN pip3 install --upgrade pip \
&& pip3 install --upgrade Pillow \
&& pip3 install triton==1.0.0 py-cpuinfo \
&& DS_BUILD_OPS=1 pip3 install deepspeed>=0.7.6
COPY    requirements.txt    .
RUN pip3 install -r requirements.txt
RUN apt-get remove -y --purge build-essential \
&& apt-get autoremove --assume-yes \
&& rm -rf /var/lib/apt/lists/* /var/cache/apt

0xDing commented 1 year ago

@jbkyang-nvi hello, sorry for late reply, so for exact replication-> while building from repository for container version 22.12 after installing TF version:2.1.0 and transformers version 2.11.0, if you start the triton inference server you will get this issue that libdcgm.so.2.* files are missing. But this issue got solved if i used lower version of Triton container 22.10, with same TF versions and transformer versions. @jbkyang-nvi same issue when
FROM  nvcr.io/nvidia/tritonserver:22.12-pyt-python-py3
WORKDIR   /app
ENV   TORCH_NVCC_FLAGS    "-D__CUDA_NO_HALF_OPERATORS__"
ENV   TORCH_CUDA_ARCH_LIST    "8.0 8.6"
ENV   SAFETENSORS_FAST_GPU    1
RUN   apt-get update \
  && apt-get upgrade -y \
  && apt-get install -y \
  bash \
  build-essential \
  libaio-dev \
  libaio1 \
  libsndfile-dev \
  libcupti-dev \
  libjpeg-dev \
  libpng-dev \
  libwebp-dev
RUN   pip3 install --upgrade pip \
  && pip3 install --upgrade Pillow \
  && pip3 install triton==1.0.0 py-cpuinfo \
  && DS_BUILD_OPS=1 pip3 install deepspeed>=0.7.6
COPY  requirements.txt    .
RUN   pip3 install -r requirements.txt
RUN   apt-get remove -y --purge build-essential \
  && apt-get autoremove --assume-yes \
  && rm -rf /var/lib/apt/lists/* /var/cache/apt

~ln -s /usr/lib/x86_64-linux-gnu/libdcgm.so.3 /usr/lib/x86_64-linux-gnu/libdcgm.so.2~

apt-get install -y datacenter-gpu-manage=1:2.4.7 could fix it

ApoorveK commented 1 year ago

@0xDing I believe this is one of main libraries need to add gpu support to triton image, so even though this could solve the issue. Ideally it should be fixed in newer version of triton image itself.

ApoorveK commented 1 year ago

Also @jbkyang-nvi @0xDing please suggest a alternative for communication so that i can reach out to you all (either mail, or some slack channel or something else) as i have some other use-case specific queries (related to Triton Inference Server, Model-analyzer) which i believe needed to be catered.

jbkyang-nvi commented 1 year ago

@ApoorveK Github is usually the best place to ask questions. If you have proprietary models/questions you want to ask I would suggest looking into Nvidia AI Enterprise

ApoorveK commented 1 year ago

@jbkyang-nvi I understand, the reason behind above suggestions is due to the fact that the Triton Inference Server is being open-source, needs a proper IRC channel for the community to interact with each other rather than going for filling redundant individual requests to AI Enterprise which could have just being solved by other community members. Also, the above mentioned issue by @0xDing as well, so I guess only help needed here is wider dependencies support for custom Triton Image for varied use-case

Chasun-fhd commented 1 year ago

@jbkyang-nvi Hi, i save the same problem. TritonServer 23.01. While i use offical image locally, it worked fine. but, when i build my own image based on the tritonserver:23.01-py3, and execute the command tritonserver --model-repository=xxx, the errros shows: tritonserver: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory

but, the so lib exists in /usr/lib/x86_64-linux-gnu/libdcgm.so.2

i also tried to export env: LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/libdcgm.so.2:/usr/local/cuda-11.8/lib64:/opt/tritonserver/backends/onnxruntime:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64

it still doesn't work. wish for help ~

attach my docker file:

FROM nvcr.tencentcloudcr.com/nvidia/tritonserver:23.01-py3
MAINTAINER fenghaidong@hotmail.com
LABEL description="Tritonserver 23.01 image" by="jerry"

ENV DEBIAN_FRONTEND=noninteractive

#update apt-get
RUN apt-get -y update && apt-get -y upgrade
#basic 
RUN apt-get -y install locate curl vim unzip wget\
&& echo "export LANG=C.UTF-8" |tee -a /etc/profile && source /etc/profile\
&& echo "Asia/Shanghai" > /etc/timezone

#install aws
ADD ./.aws /root/.aws
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
    && unzip awscliv2.zip \
    && ./aws/install

# install jdk
ADD jdk-8u351-linux-x64.tar.gz /usr/local/jdk

ENV JAVA_HOME /usr/local/jdk/jdk1.8.0_351
ENV JRE_HOME /usr/local/jdk/jdk1.8.0_351/jre
ENV CLASS_PATH $JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib:$CLASSPATH
ENV PATH $JAVA_HOME/bin:$PATH
ENV LD_LIBRARY_PATH /usr/lib/x86_64-linux-gnu/libdcgm.so.2:${LD_LIBRARY_PATH}

COPY id_rsa /root/.ssh/id_rsa
COPY known_hosts /root/.ssh/known_hosts

WORKDIR /inference-service

This command ldd shows didn't find the so lib

root@f644af90d314:/inference-service# ldd /opt/tritonserver/bin/tritonserver |grep libdcgm
    libdcgm.so.2 => not found

paulannetts commented 1 year ago

We're seeing similar - seems to be related to a apt-get upgrade - if I remove this line the image works.

This is important as we want to apply OS level security patches, for instance for CVE-2023-0286

jbkyang-nvi commented 1 year ago

Thanks for reporting. I have filed a bug report with the team.

jbkyang-nvi commented 1 year ago

@paulannetts do you mean if you don't apt-get upgrade then libdcgm.so somehow gets found?

paulannetts commented 1 year ago

@paulannetts do you mean if you don't apt-get upgrade then libdcgm.so somehow gets found?

Yes, for us that was the only difference in our Dockerfile.

jbkyang-nvi commented 1 year ago

@Chasun-fhd why do you have this line

ENV LD_LIBRARY_PATH /usr/lib/x86_64-linux-gnu/libdcgm.so.2:${LD_LIBRARY_PATH}

jbkyang-nvi commented 1 year ago

@paulannetts can you share your dockerfile/commands you run on top of the Triton container?

jbkyang-nvi commented 1 year ago

Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue.

jbkyang-nvi commented 1 year ago

@jbkyang-nvi I understand, the reason behind above suggestions is due to the fact that the Triton Inference Server is being open-source, needs a proper IRC channel for the community to interact with each other rather than going for filling redundant individual requests to AI Enterprise which could have just being solved by other community members. Also, the above mentioned issue by @0xDing as well, so I guess only help needed here is wider dependencies support for custom Triton Image for varied use-case

We have opened up https://github.com/triton-inference-server/server/discussions in hopes that this would help facilitate discussions between users. Hopefully this will help in the future.

wq9 commented 7 months ago

This solved it for me: apt-mark hold datacenter-gpu-manager

Without the above, apt-get upgrade makes the /usr/lib/x86_64-linux-gnu/libdcgm.so.2 become .../libdcgm.so.3

triton-inference-server / server

Error while loading shared libraries: libdcgm.so.2 #4965