unable to load shared library: libnvinfer_plugin_tensorrt_llm.so.9 using nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

jlewi commented 1 month ago

System Info

CPU Architecture x86 A100 40GB

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Follow the https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md
Use the image nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3
Try to start the backend. You will get the error

+ '[' 1 -eq 0 ']'
+ command=serve
+ export DATADIR=/data
+ DATADIR=/data
+ export TRTDIR=/data/git_TensorRT-LLM
+ TRTDIR=/data/git_TensorRT-LLM
+ export MIXTRALDIR=/data/git_mixtral-8x7B-v0.1
+ MIXTRALDIR=/data/git_mixtral-8x7B-v0.1
+ export OUTPUTDIR=/data/tllm_checkpoint_mixtral_2gpu
+ OUTPUTDIR=/data/tllm_checkpoint_mixtral_2gpu
+ LLAMA_UNIFIED_CKPT_PATH=/data/ckpt/llama/7b/
+ LLAMA_ENGINE_PATH=/data/engines/llama/7b/
+ HF_LLAMA_MODEL=/data/git_Llama-2-7b-hf
+ case $command in
+ echo 'Starting Triton server...'
Starting Triton server...
+ export TRTLLM_ORCHESTRATOR=1
+ TRTLLM_ORCHESTRATOR=1
+ export LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs
+ LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs
+ tritonserver --model-repository=/data/models/llama_ifb
W0723 23:46:59.440299 2084 pinned_memory_manager.cc:271] "Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version"
I0723 23:46:59.440364 2084 cuda_memory_manager.cc:117] "CUDA memory pool disabled"
E0723 23:46:59.440454 2084 server.cc:243] "CudaDriverHelper has not been initialized."
I0723 23:46:59.445778 2084 model_lifecycle.cc:472] "loading: postprocessing:1"
I0723 23:46:59.445846 2084 model_lifecycle.cc:472] "loading: preprocessing:1"
I0723 23:46:59.445986 2084 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
I0723 23:46:59.446031 2084 model_lifecycle.cc:472] "loading: tensorrt_llm_bls:1"
E0723 23:46:59.448720 2084 model_lifecycle.cc:641] "failed to load 'tensorrt_llm' version 1: Not found: unable to load shared library: libnvinfer_plugin_tensorrt_llm.so.9: cannot open shared object file: No such file or directory"
I0723 23:46:59.448763 2084 model_lifecycle.cc:776] "failed to load 'tensorrt_llm'"
I0723 23:47:01.013657 2084 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)"
I0723 23:47:01.228214 2084 model_lifecycle.cc:838] "successfully loaded 'tensorrt_llm_bls'"
I0723 23:47:02.833369 2084 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
I0723 23:47:02.834550 2084 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
[TensorRT-LLM][WARNING] Don't setup 'skip_special_tokens' correctly (set value is ${skip_special_tokens}). Set it as True by default.
[TensorRT-LLM][WARNING] Don't setup 'add_special_tokens' correctly (set value is ${add_special_tokens}). Set it as True by default.
I0723 23:47:04.829608 2084 model_lifecycle.cc:838] "successfully loaded 'postprocessing'"
I0723 23:47:04.845120 2084 model_lifecycle.cc:838] "successfully loaded 'preprocessing'"
E0723 23:47:04.845236 2084 model_repository_manager.cc:614] "Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Not found: unable to load shared library: libnvinfer_plugin_tensorrt_llm.so.9: cannot open shared object file: No such file or directory;"
I0723 23:47:04.845354 2084 server.cc:606] 
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0723 23:47:04.845398 2084 server.cc:633] 
+---------+-------------------------------------------------------+--------------------------------------------------------------------------------------------------+
| Backend | Path                                                  | Config                                                                                           |
+---------+-------------------------------------------------------+--------------------------------------------------------------------------------------------------+
| python  | /opt/tritonserver/backends/python/libtriton_python.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min- |
|         |                                                       | compute-capability":"6.000000","default-max-batch-size":"4"}}                                    |
+---------+-------------------------------------------------------+--------------------------------------------------------------------------------------------------+

I0723 23:47:04.845501 2084 server.cc:676] 
+------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------+
| Model            | Version | Status                                                                                                                                |
+------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------+
| postprocessing   | 1       | READY                                                                                                                                 |
| preprocessing    | 1       | READY                                                                                                                                 |
| tensorrt_llm     | 1       | UNAVAILABLE: Not found: unable to load shared library: libnvinfer_plugin_tensorrt_llm.so.9: cannot open shared object file: No such f |
|                  |         | ile or directory                                                                                                                      |
| tensorrt_llm_bls | 1       | READY                                                                                                                                 |
+------------------+---------+---------------------------------------------------------------------------------------------------------------------------------------+

Error: Failed to initialize NVML
W0723 23:47:04.846718 2084 metrics.cc:798] "DCGM unable to start: DCGM initialization error"
I0723 23:47:04.846881 2084 metrics.cc:770] "Collecting CPU metrics"
I0723 23:47:04.847003 2084 tritonserver.cc:2557] 
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                            |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                           |
| server_version                   | 2.46.0                                                                                                                           |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_m |
|                                  | emory cuda_shared_memory binary_tensor_data parameters statistics trace logging                                                  |
| model_repository_path[0]         | /data/models/llama_ifb                                                                                                           |
| model_control_mode               | MODE_NONE                                                                                                                        |
| strict_model_config              | 0                                                                                                                                |
| model_config_name                |                                                                                                                                  |
| rate_limit                       | OFF                                                                                                                              |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                        |
| min_supported_compute_capability | 6.0                                                                                                                              |
| strict_readiness                 | 1                                                                                                                                |
| exit_timeout                     | 30                                                                                                                               |
| cache_enabled                    | 0                                                                                                                                |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------+

I0723 23:47:04.847113 2084 server.cc:307] "Waiting for in-flight requests to complete."
I0723 23:47:04.847134 2084 server.cc:323] "Timeout 30: Found 0 model versions that have in-flight inferences"
I0723 23:47:04.847833 2084 server.cc:338] "All models are stopped, unloading models"
I0723 23:47:04.847854 2084 server.cc:347] "Timeout 30: Found 3 live models and 0 in-flight non-inference requests"
I0723 23:47:05.847986 2084 server.cc:347] "Timeout 29: Found 3 live models and 0 in-flight non-inference requests"
Cleaning up...
Cleaning up...
Cleaning up...
I0723 23:47:05.877044 2084 model_lifecycle.cc:623] "successfully unloaded 'tensorrt_llm_bls' version 1"
I0723 23:47:06.213953 2084 model_lifecycle.cc:623] "successfully unloaded 'postprocessing' version 1"
I0723 23:47:06.215902 2084 model_lifecycle.cc:623] "successfully unloaded 'preprocessing' version 1"
I0723 23:47:06.848121 2084 server.cc:347] "Timeout 28: Found 0 live models and 0 in-flight non-inference requests"
error: creating server: Internal - failed to load all models
command terminated with exit code 1

Expected behavior

I expected the server to start.

actual behavior

I get an error

UNAVAILABLE: Not found: unable to load shared library: libnvinfer_plugin_tensorrt_llm.so.9: cannot open shared object file

additional notes

The library libnvinfer_plugin_tensorrt_llm.so is available

 find / -name "libnvinfer_plugin_tensorrt_llm*"

find: ‘/proc/580/task/580/net’: Invalid argument
find: ‘/proc/580/net’: Invalid argument
find: ‘/proc/688/task/688/net’: Invalid argument
/usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so
find: ‘/proc/688/net’: Invalid argument
command terminated with exit code 1

I have set

export LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs

jlewi commented 1 month ago

I was able to work around this by doing the following

Create a symbolic link

ln -s /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so /usr/lib/libnvinfer_plugin_tensorrt_llm.so.9

Set the LD_LIBRARY_PATH as follows

export LD_LIBRARY_PATH=/usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs:/usr/local/nvidia/lib64:/opt/tritonserver/backends/tensorrtllm:/opt/tritonserver/lib

If I didn't set the LD_LIBRARY_PATH I got errors about not being able to find a bunch of different libraries

libcuda.so.1
libtriton_tensorrtllm_common.so
libtritonserver.so

I'm running on GKE. I believe libcuda.so.1 is provided by the driver and gets installed on the host. That might explain why it gets installed in a location. that the triton server image doesn't know about and requires explicit configuration. I'm not sure about the others.

amirbilu commented 1 month ago

this seems to be solved by adding
RUN ldconfig

triton-inference-server / tensorrtllm_backend