triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.97k stars 1.44k forks source link

libboost_filesystem.so.1.80.0 on jetpack 5.1.2 #6844

Open ksokolov-vaisto opened 7 months ago

ksokolov-vaisto commented 7 months ago

Description trying to run tritonserver2.35.0-jetpack5.1.2-update-1.tgz on a L4T R35.4.1 system with jetpack 5.1.2 results in error while loading shared libraries: libboost_system.so.1.80.0: cannot open shared object file: No such file or directory

looks like highest libboost is 1.71.0 on 5.1.2

ls /usr/lib/aarch64-linux-gnu/ | grep libboost_file 
libboost_filesystem.so.1.71.0

Triton Information tritonserver2.35.0-jetpack5.1.2-update-1.tgz

To Reproduce follow install steps from jetson.md on a device with Jetpack 5.1.2

gangchen03 commented 7 months ago

I had a similar problem when building a Python backend only images. To reproduce:

Server crashed (failed to start) with the following error:

rver:latest tritonserver --model-repository=/models

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 24.01 (build 80100513)
Triton Server Version 2.42.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

tritonserver: error while loading shared libraries: libboost_filesystem.so.1.80.0: cannot open shared object file: No such file or directory

If I use prebuilt images, it worked fine. sudo docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /path/triton/python_backend/models:/models nvcr.io/nvidia/tritonserver:24.01-py3 tritonserver --model-repository=/models

Please advise.

ksokolov-vaisto commented 7 months ago

I had a similar problem when building a Python backend only images. To reproduce:

* Clone the Triton [server repo](https://github.com/triton-inference-server/server/tree/main).
  on branch `main`.
  `cd server`

* Build a python only Triton server docker image.
  ```
   sudo python3 compose.py --backend python --repoagent checksum
  ```

* Then run the Triton server:
  `sudo docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /path/triton/python_backend/models:/models tritonserver:latest tritonserver --model-repository=/models`

Server crashed (failed to start) with the following error:

rver:latest tritonserver --model-repository=/models

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 24.01 (build 80100513)
Triton Server Version 2.42.0

Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

WARNING: The NVIDIA Driver was not detected.  GPU functionality will not be available.
   Use the NVIDIA Container Toolkit to start this container with GPU support; see
   https://docs.nvidia.com/datacenter/cloud-native/ .

tritonserver: error while loading shared libraries: libboost_filesystem.so.1.80.0: cannot open shared object file: No such file or directory

If I use prebuilt images, it worked fine. sudo docker run --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /path/triton/python_backend/models:/models nvcr.io/nvidia/tritonserver:24.01-py3 tritonserver --model-repository=/models

Please advise.

if you are doing the build anyways, you can fix that error by building the 1.80 version of libboost before building triton. That's what I did to build the tritonserver2.35.0-jetpack5.1.2-update-1 after the corr. binary did not run. However I was just hoping to be able to get a working binary :)

dyastremsky commented 6 months ago

Have you tried installing Boost 1.80.0 onto your device? There are instructions here.

We do something similar on our Jetson devices for development and testing. We started to release Docker containers to make it easier for users who can use them a couple of releases ago, but the tar files require users to be responsible for their own environment setup.

geometrikal commented 6 months ago

@dyastremsky do you have a Dockerfile that you can share (for L4T R35.4.1) or even what base to build off? Would nvcr.io/nvidia/l4t-base:35.4.1 work?

dyastremsky commented 6 months ago

We do not test building off a public image, but you may have success building off the l4t-base image (or l4t-ml, or one of the framework images depending on the backend you need). I have been able to build off the l4t-ml image in the past.

We don't officially support building Triton in a Jetson Docker container yet. Our official route is building on Jetson directly.

acmorrow commented 4 months ago

The prior tarball (the no longer accessible https://github.com/triton-inference-server/server/releases/download/v2.35.0/tritonserver2.35.0-jetpack5.1.2.tgz) worked correctly against the base system boost libraries on Jetson Linux / JetPack 5, but the replacement one (https://github.com/triton-inference-server/server/releases/download/v2.35.0/tritonserver2.35.0-jetpack5.1.2-update-1.tgz) does not, because of the upgraded boost dependency.

Effectively, this is a breaking change. If the tarball is intended for use on Jetson Linux / Jetpack 5 devices, it should be built against the system version of the required packages.

blthayer commented 4 months ago

Agreed with @acmorrow - I was using the previous tarball which is no longer accessible, and no custom boost work was required. This is a breaking change, and now I have to screw around with manually installing libboost version 1.80.0.

acmorrow commented 4 months ago

@blthayer - The situation is actually more puzzling than I realized. The old libtritonserver.so (I dug it out of an existing container image that I thankfully had not purged) didn't have any dynamic boost dependency at all:

# readelf -aW /opt/tritonserver/lib/libtritonserver.so | grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libnuma.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libcudart.so.11.0]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-aarch64.so.1]

But the new one definitely does:

# readelf -aW tritonserver/lib/libtritonserver.so| grep NEEDED
 0x0000000000000001 (NEEDED)             Shared library: [libboost_filesystem.so.1.80.0]
 0x0000000000000001 (NEEDED)             Shared library: [libdl.so.2]
 0x0000000000000001 (NEEDED)             Shared library: [libnuma.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libcudart.so.11.0]
 0x0000000000000001 (NEEDED)             Shared library: [libpthread.so.0]
 0x0000000000000001 (NEEDED)             Shared library: [libstdc++.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [libgcc_s.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-aarch64.so.1]

Furthermore, in the old one, there are boost symbols, but they are all defined and LOCAL:

# readelf -aW /opt/tritonserver/lib/libtritonserver.so | grep boost | grep -c LOCAL
41
# readelf -aW /opt/tritonserver/lib/libtritonserver.so | grep boost | grep -vc LOCAL
0
# readelf -aW /opt/tritonserver/lib/libtritonserver.so | grep boost | grep -c UND
0

But in the new one there are non-local undefined symbols:

# readelf -aW tritonserver/lib/libtritonserver.so| grep boost | grep -c LOCAL
70
# readelf -aW tritonserver/lib/libtritonserver.so| grep boost | grep -vc LOCAL
10
# readelf -aW tritonserver/lib/libtritonserver.so| grep boost | grep UND
    42: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND _ZN5boost10filesystem6detail9canonicalERKNS0_4pathES4_PNS_6system10error_codeE
   120: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND _ZN5boost10filesystem6detail16weakly_canonicalERKNS0_4pathES4_PNS_6system10error_codeE
   138: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND _ZN5boost10filesystem6detail12current_pathEPNS_6system10error_codeE
 20229: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND _ZN5boost10filesystem6detail9canonicalERKNS0_4pathES4_PNS_6system10error_codeE
 20383: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND _ZN5boost10filesystem6detail16weakly_canonicalERKNS0_4pathES4_PNS_6system10error_codeE
 20415: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND _ZN5boost10filesystem6detail12current_pathEPNS_6system10error_codeE

The undefined non-local symbols are exactly the boost file system symbols that are to be satisfied by the DT_NEEDED entry for boost_filesystem.

This looks like the intended encapsulation of boost was broken. Maybe this is simply a bad build of libtritonserver.so instead of an intended novel dependency?

blthayer commented 4 months ago

Thanks @acmorrow for the additional details! Hopefully a project maintainer will weigh in, but seeing as this issue was opened on January 29th and it's now April 18th I wouldn't hold my breath...

acmorrow commented 4 months ago

@blthayer - Yes, not holding my breath either. However, maybe pointing out that this looks like a mistake rather than an intentional breaking change will get the relevant project maintainers to take a closer look. We will see!

dyastremsky commented 4 months ago

Thank you all for highlighting the issue. We had not realized that this breaking change occurred in the patched version. There was discussion to confirm what happened, which is why it took a few days to provide an update. We're looking into it now and hope to provide a fix soon.

Ref: DLIS-6529

acmorrow commented 4 months ago

@dyastremsky - That's great news. I'm looking forward to the update, and I'm happy we were able to get this brought to the maintainers' attention and that it will be acted on.

I'm curious: will the upcoming fix be a respin of Release 2.35.0 corresponding to NGC container 23.06 (i.e. a -update-2), or a Jetpack build of a newer Trition Server release like Release 2.44.0 corresponding to NGC container 24.03. The 2.35/23.06 release is nearly a year out of date at this point.

dyastremsky commented 4 months ago

The aim is to do a respin. Is there something wrong with the newer Jetson releases (the tar files and/or Docker containers)? We do have new releases. They were renamed iGPU to better encompass the whole ecosystem which includes Jetson, which may be confusing. If there is something else off, let me know as I may be out of the loop.

Here are the latest release notes. You can access this tar file or use this container (nvcr.io/nvidia/tritonserver:24.03-py3-igpu)

acmorrow commented 4 months ago

@dyastremsky - I had seen the iGPU releases but did not understand that they were for Jetson as well. Perhaps the release notes should state that more prominently. A respin of the broken 2.35 Jetson tarball is definitely the right thing then, and much appreciated.

dyastremsky commented 4 months ago

Perfect, thanks for that feedback! I'll communicate it back to the team.

blthayer commented 4 months ago

Hi @dyastremsky - any updates on this?

I don't understand the Jetson packaging. Jetpack 6 is not yet in general availability, but it seems that every igpu release (which starts on version 2.40.0) is built against CUDA 12 instead of CUDA 11 (Jetpack 5). Can you add some clarity around packaging for Jetson? Are there CUDA 11 variants available? Is it standard practice to publish formal releases even when the SDK it's built against isn't yet in general availability? I was hoping to be able to use TIS releases as opposed to having to set up all the infrastructure to build from source myself.

At any rate, this broken 2.35 Jetson release is starting to block me/my company who extensively uses Jetson devices. I'd rather not have to add code to my CI pipeline to mess around with boost for this one-off problem, and would be very appreciative if a fixed 2.35.0 version could be published sooner rather than later.

dyastremsky commented 4 months ago

I have been working to get this prioritized and reached out again. If it is of interest, there is additional support and guaranteed stability offered in the NVIDIA AI Enterprise program (NVAIE). There is also NVIDIA Inception, a free program for start-ups.

We're reprioritizing work to address this, but I do not have an estimate at this time on when it's done. Based on this conversation, we also have a ticket to document some upstreams for Jetson like we do for Triton.

As far as the question of CUDA and Jetpack support, let me tag @nv-kmcgill53 who would know more about the Jetpack and CUDA support. To answer your question, our standard practice is to follow our upstream versions (e.g. our PyTorch in the iGPU container would match the NVIDIA iGPU PyTorch in the same release, like 24.04). Everything should be generally available, as far as I understand.

CC: @nvda-mesharma

blthayer commented 4 months ago

@dyastremsky - thank you for your prompt response! Looking forward to hearing more as time progresses.

On CUDA/Jetpack: JetPack 6 is still in developer preview. Jetpack 6 is the first Jetpack release that ships with CUDA 12. Jetpack 5 utilizes CUDA 11.

Previously (ending with 2.35.0), this project released TIS tarballs for Jetson that corresponded to a Jetpack version. Now, with this more generic igpu release, it seems the tie to Jetpack has been broken, because as I said, there's no GA version of Jetpack with CUDA 12, yet all the igpu releases are built against CUDA 12. So in practical terms, if you're using Jetpack, you're stuck using 2.35.0 or earlier, and as discussed on this thread, the re-release of 2.35.0 is busted :frowning_face:

blthayer commented 4 months ago

On a Jetpack 5 device with a TIS release > 2.35.0: <path to TIS>/bin/tritonserver: error while loading shared libraries: libcudart.so.12: cannot open shared object file: No such file or directory

acmorrow commented 4 months ago

@dyastremsky - I agree with @blthayer; the current situation is a bit of a dilemma. There is no useable version of Triton Server for Jetpack 5 right now since the old one was pulled but the new one doesn't work, and the container releases are all JP 6 targeted, and that hasn't been released.

As it stands, I'm unable to recreate a previously released piece of JP 5 targeted software which depended on (the withdrawn) https://github.com/triton-inference-server/server/releases/download/v2.35.0/tritonserver2.35.0-jetpack5.1.2.tgz, and I cannot update the build process to use https://github.com/triton-inference-server/server/releases/download/v2.35.0/tritonserver2.35.0-jetpack5.1.2-update-1.tgz since that doesn't work, nor can I use the iGPU containers since they appear to require the unreleased JP 6. Even if the containers did work, I still wouldn't be able to make another JP 5 targeted release, and I would need to update my entire build and release process to JP 6. That's something I will do eventually, but that will be a new release, etc.

Is there any update on when the respin of v2.35.0/tritonserver2.35.0-jetpack5.1.2.tgz might come? Could the old one be restored in the interim? Yes, it has a security vulnerability, but are all use cases are susceptible. For instance, based on what I've read, I do not believe that use of the embeddable library would be?

And, actually, now that I think about it, the situation is actually pretty bad from a security perspective. Since the broken re-release was made to cure a vulnerability, no-one who is affected by the vulnerability can re-release their affected software until the respin is complete, so the vulnerability will persist.

fpetrini15 commented 3 months ago

Hello,

First off, I apologize that this issue went unnoticed for so long. Thank you for bringing it to our attention. I have this as my top priority for this week and hope to a get re-spun patch out as soon as possible that will strip out the additional boost dependency.

Moving forward, we plan on restructuring our support for igpu builds such that we can mitigate such oversights in the future. I will inform this thread if anything changes.

blthayer commented 3 months ago

@acmorrow - for what it's worth, I ended up downgrading to 2.34.0 since I didn't need any of the new features in 2.35.0, and that appears to be working fine for me.

fpetrini15 commented 3 months ago

We're still investigating convenient packaging steps for a re-spin, however, from what I've found, I believe libboost_filesystem.so.1.80.0 is the only missing item. Does executing the following steps solve the issue for 2.35.0:

wget -O /tmp/boost.tar.gz \
          https://archives.boost.io/release/1.80.0/source/boost_1_80_0.tar.gz \
      && (cd /tmp && tar xzf boost.tar.gz) && cd /tmp/boost_1_80_0 && ./bootstrap.sh \
      && ./b2 --with-filesystem && cp /tmp/boost_1_80_0/stage/lib/libboost_filesystem.so.1.80.0 /usr/lib
rm -rf /tmp/boost_1_80_0/ #cleanup
acmorrow commented 3 months ago

@fpetrini15 - I doubt that is actually safe to do. There is no guarantee that a libboost_filesystem.so.1.80.0 compiled that way will be ABI compatible with the version of boost filesystem that the triton server was built against.

fpetrini15 commented 3 months ago

Hi folks,

I've updated the 23.06 release page with the new asset: tritonserver2.35.0-jetpack5.1.2-update-2.tgz. It proved too cumbersome to remove the boost dependency, so this new asset contains the same changes as the first update but packages the dependent boost filesystem shared object in a folder called boost_filesystem. This shared object must be added to LD_LIBRARY_PATH for proper operation.

@acmorrow you bring up a valid point, so I ensured the version of boost filesystem that was used to compile tritonserver is the one that is shipped with the asset.

Please let me know if this solves your issues.

acmorrow commented 3 months ago

@fpetrini15 - I hate to be the bearer of bad news, again, but I don't think that the offered solution with the dynamic libboost_filesystem.so library delivered alongside libtritonserver.so is going to work, because it now means that an application cannot use the system (or any other) version of boost_filesystem and libtritonserver.so at the same time. The prior working version of libtritonserver.so did make use of boost symbols, but they were built into libtritonserver.so and were not global/dynamic:

$ nm libtritonserver.so  | grep boost | wc -l
41

$ nm --extern-only libtritonserver.so  | grep boost | wc -l
0

$ nm --dynamic libtritonserver.so  | grep boost | wc -l
0

If tritonserver.so needs to make use of boost_filesystem now, then it should be interned into libtritonserver.so just like the existing boost usages (which appear mostly to be boost::intrusive and boost::interprocess)

Note though that there are other symbol resolution issues with libtritonserver.so:

acmorrow commented 2 months ago

Hi @fpetrini15 - I was wondering if you had any updates on this issue, per my notes above. My expectation was that the replacement libtritonserver.so would use a privately embedded instance of boost::filesystem, as the previous release did for other boost usages. If additional details on why the dynamic boost dependency is a problem are required, I'm happy to provide them.

acmorrow commented 1 month ago

@blthayer - Yes, ultimately I did the same downgrade and am now using 2.34 instead.