Error when trying to deploy tuned LoRA LLM models to production

TensorRT-LLM: v0.9.0.dev2024040900 tensorrt_llm : 24.03-vllm-python-py3

I need to be able to deploy tuned LoRA LLM models with Triton server and its tensorrt_llm backend. I have built the engines with TensorRT-LLM in this way

python TensorRT-LLM/examples/llama/convert_checkpoint.py --model_dir Llama-2-7b-hf \
                     --output_dir ./tllm_checkpoint_1gpu_lora_rank \
                     --dtype float16 \
                     --tp_size 1

trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_lora_rank \
        --output_dir /tmp/llama_7b_with_lora_qkv/trt_engines/fp16/1-gpu/ \
        --gemm_plugin float16 \
        --lora_plugin float16 \
        --max_batch_size 1 \
        --max_input_len 512 \
        --max_output_len 50 \
        --lora_dir Japanese-Alpaca-LoRA-7b-v0 \
        --max_lora_rank 8 \
        --lora_target_modules attn_q attn_k attn_v

which leaves me with the following artifacts:

rank0.engine config.json lora/0/adapter_config.json lora/0/adapter_model.bin

To infer I use python:

mpirun -n 1 python /app/tensorrt_llm/examples/run.py --engine_dir "/tmp/llama_7b_with_lora_qkv/trt_engines/fp16/1-gpu/ " \
              --max_output_len 50 \
              --tokenizer_dir /workspace/Llama-2-7b-hf \
              --input_text "エクアドルの首都はどこですか?" \
              --lora_task_uids -1 0 \
              --no_add_special_tokens \
              --use_py_session

From here I want to know how to deploy these artifacts through Triton server. I tried using the tensorrt_llm backend but got this error:

[TensorRT-LLM][ERROR] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 9.2.0.5 got 9.3.0.1, please rebuild.
[TensorRT-LLM][ERROR] 2: [engine.cpp::deserializeEngine::1148] Error Code 2: Internal Error (Assertion engine->deserialize(start, size, allocator, runtime) failed. )
E0416 14:45:07.646107 43329 backend_model.cc:691] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1       0x7f9fc02614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f9fc02850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7f9fc02850a0]
3       0x7f9fc214f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138
4       0x7f9fc2125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687
5       0x7f9fc211ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
6       0x7fa088466b62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fa088466b62]
7       0x7fa0884673f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fa0884673f2]
8       0x7fa088459fd5 TRITONBACKEND_ModelInstanceInitialize + 101
9       0x7fa09b532296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fa09b532296]
10      0x7fa09b5334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fa09b5334d6]
11      0x7fa09b516045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fa09b516045]
12      0x7fa09b516686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fa09b516686]
13      0x7fa09b522efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fa09b522efd]
14      0x7fa09ab86ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fa09ab86ee8]
15      0x7fa09b50cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fa09b50cf0b]
16      0x7fa09b51dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fa09b51dc65]
17      0x7fa09b52231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fa09b52231e]
18      0x7fa09b6140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fa09b6140c8]
19      0x7fa09b6179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fa09b6179ac]
20      0x7fa09b76b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fa09b76b6c2]
21      0x7fa09adf2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fa09adf2253]
22      0x7fa09ab81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fa09ab81ac3]
23      0x7fa09ac12a04 clone + 68
E0416 14:45:07.646215 43329 model_lifecycle.cc:638] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/tmp/tritonbuild/tensorrtllm/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:72)
1       0x7f9fc02614ba tensorrt_llm::common::throwRuntimeError(char const*, int, std::string const&) + 102
2       0x7f9fc02850a0 /opt/tritonserver/backends/tensorrtllm/libtensorrt_llm.so(+0x79c0a0) [0x7f9fc02850a0]
3       0x7f9fc214f742 tensorrt_llm::batch_manager::TrtGptModelInflightBatching::TrtGptModelInflightBatching(int, std::shared_ptr<nvinfer1::ILogger>, tensorrt_llm::runtime::GptModelConfig const&, tensorrt_llm::runtime::WorldConfig const&, std::vector<unsigned char, std::allocator<unsigned char> > const&, bool, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1138
4       0x7f9fc2125977 tensorrt_llm::batch_manager::TrtGptModelFactory::create(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&) + 1687
5       0x7f9fc211ce00 tensorrt_llm::batch_manager::GptManager::GptManager(std::filesystem::path const&, tensorrt_llm::batch_manager::TrtGptModelType, int, tensorrt_llm::batch_manager::batch_scheduler::SchedulerPolicy, std::function<std::list<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest>, std::allocator<std::shared_ptr<tensorrt_llm::batch_manager::InferenceRequest> > > (int)>, std::function<void (unsigned long, std::list<tensorrt_llm::batch_manager::NamedTensor, std::allocator<tensorrt_llm::batch_manager::NamedTensor> > const&, bool, std::string const&)>, std::function<std::unordered_set<unsigned long, std::hash<unsigned long>, std::equal_to<unsigned long>, std::allocator<unsigned long> > ()>, std::function<void (std::string const&)>, tensorrt_llm::batch_manager::TrtGptModelOptionalParams const&, std::optional<unsigned long>, std::optional<int>, bool) + 336
6       0x7fa088466b62 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x18b62) [0x7fa088466b62]
7       0x7fa0884673f2 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x193f2) [0x7fa0884673f2]
8       0x7fa088459fd5 TRITONBACKEND_ModelInstanceInitialize + 101
9       0x7fa09b532296 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ad296) [0x7fa09b532296]
10      0x7fa09b5334d6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1ae4d6) [0x7fa09b5334d6]
11      0x7fa09b516045 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191045) [0x7fa09b516045]
12      0x7fa09b516686 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x191686) [0x7fa09b516686]
13      0x7fa09b522efd /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19defd) [0x7fa09b522efd]
14      0x7fa09ab86ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7fa09ab86ee8]
15      0x7fa09b50cf0b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x187f0b) [0x7fa09b50cf0b]
16      0x7fa09b51dc65 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x198c65) [0x7fa09b51dc65]
17      0x7fa09b52231e /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19d31e) [0x7fa09b52231e]
18      0x7fa09b6140c8 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x28f0c8) [0x7fa09b6140c8]
19      0x7fa09b6179ac /opt/tritonserver/bin/../lib/libtritonserver.so(+0x2929ac) [0x7fa09b6179ac]
20      0x7fa09b76b6c2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3e66c2) [0x7fa09b76b6c2]
21      0x7fa09adf2253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7fa09adf2253]
22      0x7fa09ab81ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fa09ab81ac3]
23      0x7fa09ac12a04 clone + 68
I0416 14:45:07.646249 43329 model_lifecycle.cc:773] failed to load 'tensorrt_llm'

I have tried to resolve this error: [TensorRT-LLM][ERROR] 6: The engine plan file is not compatible with this version of TensorRT, expecting library version 9.2.0.5 got 9.3.0.1, please rebuild. using older versions of TensorRT-LLM (v0.8.0 and v0.7.1) but the configuration files and the generated artifacts (lora folder) do not reflect the LoRA artifact as they do since version v0.9.0:

        "lora_config": {
            "lora_dir": [
                "lora/0"
            ],
            "lora_ckpt_source": "hf",
            "max_lora_rank": 8,
            "lora_target_modules": [
                "attn_q",
                "attn_k",
                "attn_v"
            ],
            "trtllm_modules_to_hf_modules": {
                "attn_q": "q_proj",
                "attn_k": "k_proj",
                "attn_v": "v_proj",
                "attn_dense": "o_proj",
                "mlp_h_to_4h": "gate_proj",
                "mlp_4h_to_h": "down_proj",
                "mlp_gate": "up_proj"
            }
        },

I'm new to deploying tuned LLM models with LoRA, am I doing something wrong? Has anyone been able to deploy these models with Triton server and made them available through an endpoint?

triton-inference-server / tensorrtllm_backend

Error when trying to deploy tuned LoRA LLM models to production #426