Closed AhmetTelceken closed 1 year ago
Hi @AhmetTelceken ,
If you have already installed DCGM correctly, then you should make sure the required library libdcgm.so.2
in example above is findable by Triton (ex: the folder it lives in is included in LD_LIBRARY_PATH
).
Example of finding libdcgm.so.2
:
# May be faster to first check "find /usr -name libdcgm*"
$ find / -name libdcgm*
...
/usr/lib/x86_64-linux-gnu/libdcgm.so.2
Example of adding its directory (from above) to the searchable path:
# Assuming libdcgm.so.2 lives in /usr/lib/x86_64-linux-gnu like above
$ export LD_LIBRARY_PATH="/usr/lib/x86_64-linux-gnu/:${LD_LIBRARY_PATH}"
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue
@jbkyang-nvi actually this issue still persist, if you built images from the code repository, but not if the docker images is being pulled from NGC
@jbkyang-nvi actually this issue still persist, if you built images from the code repository, but not if the docker images is being pulled from NGC
@ApoorveK Can you provide exact steps to reproduce this issue. What machine/OS are yo using? Did you run build.py? Are you running Triton in a docker container? How did you build DCGM? Can you find libdcgm.so.*
in the container/machine you are running Triton?
@jbkyang-nvi hello, sorry for late reply, so for exact replication-> while building from repository for container version 22.12 after installing TF version:2.1.0 and transformers version 2.11.0, if you start the triton inference server you will get this issue that libdcgm.so.2.* files are missing. But this issue got solved if i used lower version of Triton container 22.10, with same TF versions and transformer versions.
@jbkyang-nvi hello, sorry for late reply, so for exact replication-> while building from repository for container version 22.12 after installing TF version:2.1.0 and transformers version 2.11.0, if you start the triton inference server you will get this issue that libdcgm.so.2.* files are missing. But this issue got solved if i used lower version of Triton container 22.10, with same TF versions and transformer versions. @jbkyang-nvi same issue when
FROM nvcr.io/nvidia/tritonserver:22.12-pyt-python-py3 WORKDIR /app ENV TORCH_NVCC_FLAGS "-D__CUDA_NO_HALF_OPERATORS__" ENV TORCH_CUDA_ARCH_LIST "8.0 8.6" ENV SAFETENSORS_FAST_GPU 1 RUN apt-get update \ && apt-get upgrade -y \ && apt-get install -y \ bash \ build-essential \ libaio-dev \ libaio1 \ libsndfile-dev \ libcupti-dev \ libjpeg-dev \ libpng-dev \ libwebp-dev RUN pip3 install --upgrade pip \ && pip3 install --upgrade Pillow \ && pip3 install triton==1.0.0 py-cpuinfo \ && DS_BUILD_OPS=1 pip3 install deepspeed>=0.7.6 COPY requirements.txt . RUN pip3 install -r requirements.txt RUN apt-get remove -y --purge build-essential \ && apt-get autoremove --assume-yes \ && rm -rf /var/lib/apt/lists/* /var/cache/apt
@jbkyang-nvi hello, sorry for late reply, so for exact replication-> while building from repository for container version 22.12 after installing TF version:2.1.0 and transformers version 2.11.0, if you start the triton inference server you will get this issue that libdcgm.so.2.* files are missing. But this issue got solved if i used lower version of Triton container 22.10, with same TF versions and transformer versions. @jbkyang-nvi same issue when
FROM nvcr.io/nvidia/tritonserver:22.12-pyt-python-py3 WORKDIR /app ENV TORCH_NVCC_FLAGS "-D__CUDA_NO_HALF_OPERATORS__" ENV TORCH_CUDA_ARCH_LIST "8.0 8.6" ENV SAFETENSORS_FAST_GPU 1 RUN apt-get update \ && apt-get upgrade -y \ && apt-get install -y \ bash \ build-essential \ libaio-dev \ libaio1 \ libsndfile-dev \ libcupti-dev \ libjpeg-dev \ libpng-dev \ libwebp-dev RUN pip3 install --upgrade pip \ && pip3 install --upgrade Pillow \ && pip3 install triton==1.0.0 py-cpuinfo \ && DS_BUILD_OPS=1 pip3 install deepspeed>=0.7.6 COPY requirements.txt . RUN pip3 install -r requirements.txt RUN apt-get remove -y --purge build-essential \ && apt-get autoremove --assume-yes \ && rm -rf /var/lib/apt/lists/* /var/cache/apt
~ln -s /usr/lib/x86_64-linux-gnu/libdcgm.so.3 /usr/lib/x86_64-linux-gnu/libdcgm.so.2
~
apt-get install -y datacenter-gpu-manage=1:2.4.7
could fix it
@0xDing I believe this is one of main libraries need to add gpu support to triton image, so even though this could solve the issue. Ideally it should be fixed in newer version of triton image itself.
Also @jbkyang-nvi @0xDing please suggest a alternative for communication so that i can reach out to you all (either mail, or some slack channel or something else) as i have some other use-case specific queries (related to Triton Inference Server, Model-analyzer) which i believe needed to be catered.
@ApoorveK Github is usually the best place to ask questions. If you have proprietary models/questions you want to ask I would suggest looking into Nvidia AI Enterprise
@jbkyang-nvi I understand, the reason behind above suggestions is due to the fact that the Triton Inference Server is being open-source, needs a proper IRC channel for the community to interact with each other rather than going for filling redundant individual requests to AI Enterprise which could have just being solved by other community members. Also, the above mentioned issue by @0xDing as well, so I guess only help needed here is wider dependencies support for custom Triton Image for varied use-case
@jbkyang-nvi Hi, i save the same problem. TritonServer 23.01. While i use offical image locally, it worked fine. but, when i build my own image based on the tritonserver:23.01-py3, and execute the command tritonserver --model-repository=xxx
,
the errros shows:
tritonserver: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory
but, the so lib exists in /usr/lib/x86_64-linux-gnu/libdcgm.so.2
i also tried to export env:
LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu/libdcgm.so.2:/usr/local/cuda-11.8/lib64:/opt/tritonserver/backends/onnxruntime:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
it still doesn't work. wish for help ~
attach my docker file:
FROM nvcr.tencentcloudcr.com/nvidia/tritonserver:23.01-py3
MAINTAINER fenghaidong@hotmail.com
LABEL description="Tritonserver 23.01 image" by="jerry"
ENV DEBIAN_FRONTEND=noninteractive
#update apt-get
RUN apt-get -y update && apt-get -y upgrade
#basic
RUN apt-get -y install locate curl vim unzip wget\
&& echo "export LANG=C.UTF-8" |tee -a /etc/profile && source /etc/profile\
&& echo "Asia/Shanghai" > /etc/timezone
#install aws
ADD ./.aws /root/.aws
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" \
&& unzip awscliv2.zip \
&& ./aws/install
# install jdk
ADD jdk-8u351-linux-x64.tar.gz /usr/local/jdk
ENV JAVA_HOME /usr/local/jdk/jdk1.8.0_351
ENV JRE_HOME /usr/local/jdk/jdk1.8.0_351/jre
ENV CLASS_PATH $JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:$JRE_HOME/lib:$CLASSPATH
ENV PATH $JAVA_HOME/bin:$PATH
ENV LD_LIBRARY_PATH /usr/lib/x86_64-linux-gnu/libdcgm.so.2:${LD_LIBRARY_PATH}
COPY id_rsa /root/.ssh/id_rsa
COPY known_hosts /root/.ssh/known_hosts
WORKDIR /inference-service
This command ldd
shows didn't find the so lib
root@f644af90d314:/inference-service# ldd /opt/tritonserver/bin/tritonserver |grep libdcgm
libdcgm.so.2 => not found
We're seeing similar - seems to be related to a apt-get upgrade
- if I remove this line the image works.
This is important as we want to apply OS level security patches, for instance for CVE-2023-0286
Thanks for reporting. I have filed a bug report with the team.
@paulannetts do you mean if you don't apt-get upgrade
then libdcgm.so somehow gets found?
@paulannetts do you mean if you don't
apt-get upgrade
then libdcgm.so somehow gets found?
Yes, for us that was the only difference in our Dockerfile.
@Chasun-fhd why do you have this line
ENV LD_LIBRARY_PATH /usr/lib/x86_64-linux-gnu/libdcgm.so.2:${LD_LIBRARY_PATH}
@paulannetts can you share your dockerfile/commands you run on top of the Triton container?
Closing issue due to lack of activity. Please re-open the issue if you would like to follow up with this issue.
@jbkyang-nvi I understand, the reason behind above suggestions is due to the fact that the Triton Inference Server is being open-source, needs a proper IRC channel for the community to interact with each other rather than going for filling redundant individual requests to AI Enterprise which could have just being solved by other community members. Also, the above mentioned issue by @0xDing as well, so I guess only help needed here is wider dependencies support for custom Triton Image for varied use-case
We have opened up https://github.com/triton-inference-server/server/discussions in hopes that this would help facilitate discussions between users. Hopefully this will help in the future.
This solved it for me: apt-mark hold datacenter-gpu-manager
Without the above, apt-get upgrade
makes the /usr/lib/x86_64-linux-gnu/libdcgm.so.2
become .../libdcgm.so.3
Description I succesfully installed triton inference server to my local computer. I tried to run the server but i got the error libdcgm.so.2 cannot opened. I am sure that i succesfully installed nvdia dcgm as well. Do you have an idea why i am getting the error?
/opt/tritonserver2/bin/tritonserver: error while loading shared libraries: libdcgm.so.2: cannot open shared object file: No such file or directory
Triton Information Triton 2.17.0 (container 21.12) ubuntu 20.04 cuda 11.5.0
Are you using the Triton container or did you build it yourself? - Did myself.
Expected behavior Starting corectly.