Open ksokolov-vaisto opened 10 months ago
I had a similar problem when building a Python backend only images. To reproduce:
Clone the Triton server repo.
on branch main
.
cd server
Build a python only Triton server docker image.
sudo python3 compose.py --backend python --repoagent checksum
Then run the Triton server:
sudo docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /path/triton/python_backend/models:/models tritonserver:latest tritonserver --model-repository=/models
Server crashed (failed to start) with the following error:
rver:latest tritonserver --model-repository=/models
=============================
== Triton Inference Server ==
=============================
NVIDIA Release 24.01 (build 80100513)
Triton Server Version 2.42.0
Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available.
Use the NVIDIA Container Toolkit to start this container with GPU support; see
https://docs.nvidia.com/datacenter/cloud-native/ .
tritonserver: error while loading shared libraries: libboost_filesystem.so.1.80.0: cannot open shared object file: No such file or directory
If I use prebuilt images, it worked fine.
sudo docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /path/triton/python_backend/models:/models nvcr.io/nvidia/tritonserver:24.01-py3 tritonserver --model-repository=/models
Please advise.
I had a similar problem when building a Python backend only images. To reproduce:
* Clone the Triton [server repo](https://github.com/triton-inference-server/server/tree/main). on branch `main`. `cd server` * Build a python only Triton server docker image. ``` sudo python3 compose.py --backend python --repoagent checksum ``` * Then run the Triton server: `sudo docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /path/triton/python_backend/models:/models tritonserver:latest tritonserver --model-repository=/models`
Server crashed (failed to start) with the following error:
rver:latest tritonserver --model-repository=/models ============================= == Triton Inference Server == ============================= NVIDIA Release 24.01 (build 80100513) Triton Server Version 2.42.0 Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved. This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license WARNING: The NVIDIA Driver was not detected. GPU functionality will not be available. Use the NVIDIA Container Toolkit to start this container with GPU support; see https://docs.nvidia.com/datacenter/cloud-native/ . tritonserver: error while loading shared libraries: libboost_filesystem.so.1.80.0: cannot open shared object file: No such file or directory
If I use prebuilt images, it worked fine.
sudo docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /path/triton/python_backend/models:/models nvcr.io/nvidia/tritonserver:24.01-py3 tritonserver --model-repository=/models
Please advise.
if you are doing the build anyways, you can fix that error by building the 1.80 version of libboost before building triton. That's what I did to build the tritonserver2.35.0-jetpack5.1.2-update-1 after the corr. binary did not run. However I was just hoping to be able to get a working binary :)
Have you tried installing Boost 1.80.0 onto your device? There are instructions here.
We do something similar on our Jetson devices for development and testing. We started to release Docker containers to make it easier for users who can use them a couple of releases ago, but the tar files require users to be responsible for their own environment setup.
@dyastremsky do you have a Dockerfile that you can share (for L4T R35.4.1) or even what base to build off? Would nvcr.io/nvidia/l4t-base:35.4.1 work?
We do not test building off a public image, but you may have success building off the l4t-base image (or l4t-ml, or one of the framework images depending on the backend you need). I have been able to build off the l4t-ml image in the past.
We don't officially support building Triton in a Jetson Docker container yet. Our official route is building on Jetson directly.
The prior tarball (the no longer accessible https://github.com/triton-inference-server/server/releases/download/v2.35.0/tritonserver2.35.0-jetpack5.1.2.tgz) worked correctly against the base system boost libraries on Jetson Linux / JetPack 5, but the replacement one (https://github.com/triton-inference-server/server/releases/download/v2.35.0/tritonserver2.35.0-jetpack5.1.2-update-1.tgz) does not, because of the upgraded boost dependency.
Effectively, this is a breaking change. If the tarball is intended for use on Jetson Linux / Jetpack 5 devices, it should be built against the system version of the required packages.
Agreed with @acmorrow - I was using the previous tarball which is no longer accessible, and no custom boost work was required. This is a breaking change, and now I have to screw around with manually installing libboost version 1.80.0.
@blthayer - The situation is actually more puzzling than I realized. The old libtritonserver.so
(I dug it out of an existing container image that I thankfully had not purged) didn't have any dynamic boost dependency at all:
# readelf -aW /opt/tritonserver/lib/libtritonserver.so | grep NEEDED
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libnuma.so.1]
0x0000000000000001 (NEEDED) Shared library: [libcudart.so.11.0]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-aarch64.so.1]
But the new one definitely does:
# readelf -aW tritonserver/lib/libtritonserver.so| grep NEEDED
0x0000000000000001 (NEEDED) Shared library: [libboost_filesystem.so.1.80.0]
0x0000000000000001 (NEEDED) Shared library: [libdl.so.2]
0x0000000000000001 (NEEDED) Shared library: [libnuma.so.1]
0x0000000000000001 (NEEDED) Shared library: [libcudart.so.11.0]
0x0000000000000001 (NEEDED) Shared library: [libpthread.so.0]
0x0000000000000001 (NEEDED) Shared library: [libstdc++.so.6]
0x0000000000000001 (NEEDED) Shared library: [libgcc_s.so.1]
0x0000000000000001 (NEEDED) Shared library: [libc.so.6]
0x0000000000000001 (NEEDED) Shared library: [ld-linux-aarch64.so.1]
Furthermore, in the old one, there are boost symbols, but they are all defined and LOCAL
:
# readelf -aW /opt/tritonserver/lib/libtritonserver.so | grep boost | grep -c LOCAL
41
# readelf -aW /opt/tritonserver/lib/libtritonserver.so | grep boost | grep -vc LOCAL
0
# readelf -aW /opt/tritonserver/lib/libtritonserver.so | grep boost | grep -c UND
0
But in the new one there are non-local undefined symbols:
# readelf -aW tritonserver/lib/libtritonserver.so| grep boost | grep -c LOCAL
70
# readelf -aW tritonserver/lib/libtritonserver.so| grep boost | grep -vc LOCAL
10
# readelf -aW tritonserver/lib/libtritonserver.so| grep boost | grep UND
42: 0000000000000000 0 FUNC GLOBAL DEFAULT UND _ZN5boost10filesystem6detail9canonicalERKNS0_4pathES4_PNS_6system10error_codeE
120: 0000000000000000 0 FUNC GLOBAL DEFAULT UND _ZN5boost10filesystem6detail16weakly_canonicalERKNS0_4pathES4_PNS_6system10error_codeE
138: 0000000000000000 0 FUNC GLOBAL DEFAULT UND _ZN5boost10filesystem6detail12current_pathEPNS_6system10error_codeE
20229: 0000000000000000 0 FUNC GLOBAL DEFAULT UND _ZN5boost10filesystem6detail9canonicalERKNS0_4pathES4_PNS_6system10error_codeE
20383: 0000000000000000 0 FUNC GLOBAL DEFAULT UND _ZN5boost10filesystem6detail16weakly_canonicalERKNS0_4pathES4_PNS_6system10error_codeE
20415: 0000000000000000 0 FUNC GLOBAL DEFAULT UND _ZN5boost10filesystem6detail12current_pathEPNS_6system10error_codeE
The undefined non-local symbols are exactly the boost file system symbols that are to be satisfied by the DT_NEEDED
entry for boost_filesystem
.
This looks like the intended encapsulation of boost was broken. Maybe this is simply a bad build of libtritonserver.so
instead of an intended novel dependency?
Thanks @acmorrow for the additional details! Hopefully a project maintainer will weigh in, but seeing as this issue was opened on January 29th and it's now April 18th I wouldn't hold my breath...
@blthayer - Yes, not holding my breath either. However, maybe pointing out that this looks like a mistake rather than an intentional breaking change will get the relevant project maintainers to take a closer look. We will see!
Thank you all for highlighting the issue. We had not realized that this breaking change occurred in the patched version. There was discussion to confirm what happened, which is why it took a few days to provide an update. We're looking into it now and hope to provide a fix soon.
Ref: DLIS-6529
@dyastremsky - That's great news. I'm looking forward to the update, and I'm happy we were able to get this brought to the maintainers' attention and that it will be acted on.
I'm curious: will the upcoming fix be a respin of Release 2.35.0 corresponding to NGC container 23.06 (i.e. a -update-2
), or a Jetpack build of a newer Trition Server release like Release 2.44.0 corresponding to NGC container 24.03. The 2.35/23.06 release is nearly a year out of date at this point.
The aim is to do a respin. Is there something wrong with the newer Jetson releases (the tar files and/or Docker containers)? We do have new releases. They were renamed iGPU to better encompass the whole ecosystem which includes Jetson, which may be confusing. If there is something else off, let me know as I may be out of the loop.
Here are the latest release notes. You can access this tar file or use this container (nvcr.io/nvidia/tritonserver:24.03-py3-igpu
)
@dyastremsky - I had seen the iGPU releases but did not understand that they were for Jetson as well. Perhaps the release notes should state that more prominently. A respin of the broken 2.35 Jetson tarball is definitely the right thing then, and much appreciated.
Perfect, thanks for that feedback! I'll communicate it back to the team.
Hi @dyastremsky - any updates on this?
I don't understand the Jetson packaging. Jetpack 6 is not yet in general availability, but it seems that every igpu
release (which starts on version 2.40.0) is built against CUDA 12 instead of CUDA 11 (Jetpack 5). Can you add some clarity around packaging for Jetson? Are there CUDA 11 variants available? Is it standard practice to publish formal releases even when the SDK it's built against isn't yet in general availability? I was hoping to be able to use TIS releases as opposed to having to set up all the infrastructure to build from source myself.
At any rate, this broken 2.35
Jetson release is starting to block me/my company who extensively uses Jetson devices. I'd rather not have to add code to my CI pipeline to mess around with boost for this one-off problem, and would be very appreciative if a fixed 2.35.0
version could be published sooner rather than later.
I have been working to get this prioritized and reached out again. If it is of interest, there is additional support and guaranteed stability offered in the NVIDIA AI Enterprise program (NVAIE). There is also NVIDIA Inception, a free program for start-ups.
We're reprioritizing work to address this, but I do not have an estimate at this time on when it's done. Based on this conversation, we also have a ticket to document some upstreams for Jetson like we do for Triton.
As far as the question of CUDA and Jetpack support, let me tag @nv-kmcgill53 who would know more about the Jetpack and CUDA support. To answer your question, our standard practice is to follow our upstream versions (e.g. our PyTorch in the iGPU container would match the NVIDIA iGPU PyTorch in the same release, like 24.04). Everything should be generally available, as far as I understand.
CC: @nvda-mesharma
@dyastremsky - thank you for your prompt response! Looking forward to hearing more as time progresses.
On CUDA/Jetpack: JetPack 6 is still in developer preview. Jetpack 6 is the first Jetpack release that ships with CUDA 12. Jetpack 5 utilizes CUDA 11.
Previously (ending with 2.35.0
), this project released TIS tarballs for Jetson that corresponded to a Jetpack version. Now, with this more generic igpu
release, it seems the tie to Jetpack has been broken, because as I said, there's no GA version of Jetpack with CUDA 12, yet all the igpu
releases are built against CUDA 12. So in practical terms, if you're using Jetpack, you're stuck using 2.35.0
or earlier, and as discussed on this thread, the re-release of 2.35.0
is busted :frowning_face:
On a Jetpack 5 device with a TIS release > 2.35.0
: <path to TIS>/bin/tritonserver: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory
@dyastremsky - I agree with @blthayer; the current situation is a bit of a dilemma. There is no useable version of Triton Server for Jetpack 5 right now since the old one was pulled but the new one doesn't work, and the container releases are all JP 6 targeted, and that hasn't been released.
As it stands, I'm unable to recreate a previously released piece of JP 5 targeted software which depended on (the withdrawn) https://github.com/triton-inference-server/server/releases/download/v2.35.0/tritonserver2.35.0-jetpack5.1.2.tgz, and I cannot update the build process to use https://github.com/triton-inference-server/server/releases/download/v2.35.0/tritonserver2.35.0-jetpack5.1.2-update-1.tgz since that doesn't work, nor can I use the iGPU containers since they appear to require the unreleased JP 6. Even if the containers did work, I still wouldn't be able to make another JP 5 targeted release, and I would need to update my entire build and release process to JP 6. That's something I will do eventually, but that will be a new release, etc.
Is there any update on when the respin of v2.35.0/tritonserver2.35.0-jetpack5.1.2.tgz
might come? Could the old one be restored in the interim? Yes, it has a security vulnerability, but are all use cases are susceptible. For instance, based on what I've read, I do not believe that use of the embeddable library would be?
And, actually, now that I think about it, the situation is actually pretty bad from a security perspective. Since the broken re-release was made to cure a vulnerability, no-one who is affected by the vulnerability can re-release their affected software until the respin is complete, so the vulnerability will persist.
Hello,
First off, I apologize that this issue went unnoticed for so long. Thank you for bringing it to our attention. I have this as my top priority for this week and hope to a get re-spun patch out as soon as possible that will strip out the additional boost dependency.
Moving forward, we plan on restructuring our support for igpu builds such that we can mitigate such oversights in the future. I will inform this thread if anything changes.
@acmorrow - for what it's worth, I ended up downgrading to 2.34.0
since I didn't need any of the new features in 2.35.0
, and that appears to be working fine for me.
We're still investigating convenient packaging steps for a re-spin, however, from what I've found, I believe libboost_filesystem.so.1.80.0
is the only missing item. Does executing the following steps solve the issue for 2.35.0:
wget -O /tmp/boost.tar.gz \
https://archives.boost.io/release/1.80.0/source/boost_1_80_0.tar.gz \
&& (cd /tmp && tar xzf boost.tar.gz) && cd /tmp/boost_1_80_0 && ./bootstrap.sh \
&& ./b2 --with-filesystem && cp /tmp/boost_1_80_0/stage/lib/libboost_filesystem.so.1.80.0 /usr/lib
rm -rf /tmp/boost_1_80_0/ #cleanup
@fpetrini15 - I doubt that is actually safe to do. There is no guarantee that a libboost_filesystem.so.1.80.0
compiled that way will be ABI compatible with the version of boost filesystem that the triton server was built against.
Hi folks,
I've updated the 23.06 release page with the new asset: tritonserver2.35.0-jetpack5.1.2-update-2.tgz. It proved too cumbersome to remove the boost dependency, so this new asset contains the same changes as the first update but packages the dependent boost filesystem shared object in a folder called boost_filesystem
. This shared object must be added to LD_LIBRARY_PATH
for proper operation.
@acmorrow you bring up a valid point, so I ensured the version of boost filesystem that was used to compile tritonserver is the one that is shipped with the asset.
Please let me know if this solves your issues.
@fpetrini15 - I hate to be the bearer of bad news, again, but I don't think that the offered solution with the dynamic libboost_filesystem.so
library delivered alongside libtritonserver.so
is going to work, because it now means that an application cannot use the system (or any other) version of boost_filesystem
and libtritonserver.so
at the same time. The prior working version of libtritonserver.so
did make use of boost symbols, but they were built into libtritonserver.so
and were not global/dynamic:
$ nm libtritonserver.so | grep boost | wc -l
41
$ nm --extern-only libtritonserver.so | grep boost | wc -l
0
$ nm --dynamic libtritonserver.so | grep boost | wc -l
0
If tritonserver.so
needs to make use of boost_filesystem
now, then it should be interned into libtritonserver.so
just like the existing boost usages (which appear mostly to be boost::intrusive
and boost::interprocess
)
Note though that there are other symbol resolution issues with libtritonserver.so
:
Hi @fpetrini15 - I was wondering if you had any updates on this issue, per my notes above. My expectation was that the replacement libtritonserver.so
would use a privately embedded instance of boost::filesystem, as the previous release did for other boost usages. If additional details on why the dynamic boost dependency is a problem are required, I'm happy to provide them.
@blthayer - Yes, ultimately I did the same downgrade and am now using 2.34 instead.
i have installed libboost and introduced docker with volumes
but version 36.2.0 gives
UNAVAILABLE: Invalid argument: instance group yolov10x_0 of model yolov10x specifies invalid or unsupported gpu id 0. GPUs with at least the minimum required CUDA compute compatibility of 5.300000 are
but downgraded version 35.4.1 and it give me no error except models using custom backend which is not available right now dustynv/tritonserver:r35.4.1
Description trying to run
tritonserver2.35.0-jetpack5.1.2-update-1.tgz
on a L4T R35.4.1 system with jetpack 5.1.2 results inerror while loading shared libraries: libboost_system.so.1.80.0: cannot open shared object file: No such file or directory
looks like highest libboost is 1.71.0 on 5.1.2
Triton Information
tritonserver2.35.0-jetpack5.1.2-update-1.tgz
To Reproduce follow install steps from jetson.md on a device with Jetpack 5.1.2