triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.82k stars 1.42k forks source link

Model 'tensorrt_llm' loading failed with error: key 'use_context_fmha_for_generation' not found #7362

Closed jasonngap1 closed 1 week ago

jasonngap1 commented 1 month ago

Description Unable to run triton inference server with tensorrt-llm for Llama3-ChatQA-1.5-8B

Triton Information v2.46.0

Are you using the Triton container or did you build it yourself? Using Triton container image nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

To Reproduce

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama
pip install -r requirements.txt
python convert_checkpoint.py --model_dir ./Llama3-ChatQA-1.5-8B \
                             --output_dir ./Llama3-ChatQA-1.5-8B-TensorRT/ \
                             --dtype float16 \
                             --weight_only_precision int8

trtllm-build --checkpoint_dir ./Llama3-ChatQA-1.5-8B-TensorRT/ \
            --output_dir ./Llama3-ChatQA-1.5-8B-compiled/ \
            --gpt_attention_plugin float16 \
            --gemm_plugin float16 \
            --max_input_len 128000

docker run --gpus=1 --rm --net=host -v .:/models nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3 tritonserver --model-repository=/models/inflight-batch-llm

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well). For preprocessing:

# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

name: "preprocessing"
backend: "python"
max_batch_size: 8
input [
    {
        name: "QUERY"
        data_type: TYPE_STRING
        dims: [ -1 ]
    },
    {
        name: "REQUEST_OUTPUT_LEN"
        data_type: TYPE_UINT32
        dims: [ -1 ]
    },
    {
        name: "BAD_WORDS_DICT"
        data_type: TYPE_STRING
        dims: [ -1 ]
        optional: true
    },
    {
        name: "STOP_WORDS_DICT"
        data_type: TYPE_STRING
        dims: [ -1 ]
        optional: true
    },
    {
        name: "EMBEDDING_BIAS_WORDS"
        data_type: TYPE_STRING
        dims: [ -1 ]
        optional: true
    },
    {
        name: "EMBEDDING_BIAS_WEIGHTS"
        data_type: TYPE_FP32
        dims: [ -1 ]
        optional: true
    }
]
output [
    {
        name: "INPUT_ID"
        data_type: TYPE_INT32
        dims: [ -1 ]
    },
    {
        name: "REQUEST_INPUT_LEN"
        data_type: TYPE_INT32
        dims: [ 1 ]
    },
    {
        name: "BAD_WORDS_IDS"
        data_type: TYPE_INT32
        dims: [ 2, -1 ]
    },
    {
        name: "STOP_WORDS_IDS"
        data_type: TYPE_INT32
        dims: [ 2, -1 ]
    },
    {
        name: "EMBEDDING_BIAS"
        data_type: TYPE_FP32
        dims: [ -1 ]
    },
    {
        name: "REQUEST_OUTPUT_LEN"
        data_type: TYPE_UINT32
        dims: [ -1 ]
    }
]

parameters {
  key: "tokenizer_dir"
  value: {
    string_value: "/models/Llama3-ChatQA-1.5-8B"
  }
}

parameters {
  key: "tokenizer_type"
  value: {
    string_value: "auto"
  }
}

instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]

For postprocessing:

# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

name: "postprocessing"
backend: "python"
max_batch_size: 8
input [
  {
    name: "TOKENS_BATCH"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "SEQUENCE_LENGTH"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "CUM_LOG_PROBS"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "OUTPUT_LOG_PROBS"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
output [
  {
    name: "OUTPUT"
    data_type: TYPE_STRING
    dims: [ -1 ]
  }
]

parameters {
  key: "tokenizer_dir"
  value: {
    string_value: "/models/Llama3-ChatQA-1.5-8B"
  }
}

parameters {
  key: "tokenizer_type"
  value: {
    string_value: "auto"
  }
}

instance_group [
    {
        count: 1
        kind: KIND_CPU
    }
]

For tensorrt_llm:

# Copyright 2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#  * Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
#  * Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#  * Neither the name of NVIDIA CORPORATION nor the names of its
#    contributors may be used to endorse or promote products derived
#    from this software without specific prior written permission.
#
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS ``AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED.  IN NO EVENT SHALL THE COPYRIGHT OWNER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
# OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

name: "tensorrt_llm"
backend: "tensorrtllm"
max_batch_size: 8

model_transaction_policy {
  decoupled: false
}

input [
  {
    name: "input_ids"
    data_type: TYPE_INT32
    dims: [ -1 ]
    allow_ragged_batch: true
  },
  {
    name: "input_lengths"
    data_type: TYPE_INT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
  },
  {
    name: "request_output_len"
    data_type: TYPE_UINT32
    dims: [ 1 ]
  },
  {
    name: "end_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "pad_id"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "bad_words_list"
    data_type: TYPE_INT32
    dims: [ 2, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "embedding_bias"
    data_type: TYPE_FP32
    dims: [ -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "beam_width"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "temperature"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_k"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "runtime_top_p"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "len_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "repetition_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "min_length"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "presence_penalty"
    data_type: TYPE_FP32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "random_seed"
    data_type: TYPE_UINT64
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "return_log_probs"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  },
  {
    name: "stop"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "streaming"
    data_type: TYPE_BOOL
    dims: [ 1 ]
    optional: true
  },
  {
    name: "prompt_embedding_table"
    data_type: TYPE_FP16
    dims: [ -1, -1 ]
    optional: true
    allow_ragged_batch: true
  },
  {
    name: "prompt_vocab_size"
    data_type: TYPE_UINT32
    dims: [ 1 ]
    reshape: { shape: [ ] }
    optional: true
  }
]
output [
  {
    name: "output_ids"
    data_type: TYPE_INT32
    dims: [ -1, -1 ]
  },
  {
    name: "sequence_length"
    data_type: TYPE_INT32
    dims: [ -1 ]
  },
  {
    name: "cum_log_probs"
    data_type: TYPE_FP32
    dims: [ -1 ]
  },
  {
    name: "output_log_probs"
    data_type: TYPE_FP32
    dims: [ -1, -1 ]
  }
]
instance_group [
  {
    count: 1
    kind : KIND_CPU
  }
]
parameters: {
  key: "max_beam_width"
  value: {
    string_value: "${max_beam_width}"
  }
}
parameters: {
  key: "FORCE_CPU_ONLY_INPUT_TENSORS"
  value: {
    string_value: "no"
  }
}
parameters: {
  key: "gpt_model_type"
  value: {
    string_value: "inflight_fused_batching"
  }
}
parameters: {
  key: "gpt_model_path"
  value: {
    string_value: "/models/inflight-batch-llm/tensorrt_llm/1"
  }
}
parameters: {
  key: "max_tokens_in_paged_kv_cache"
  value: {
    string_value: ""
  }
}
parameters: {
  key: "max_kv_cache_length"
  value: {
    string_value: "${max_kv_cache_length}"
  }
}
parameters: {
  key: "batch_scheduler_policy"
  value: {
    string_value: "max_utilization"
  }
}
parameters: {
  key: "kv_cache_free_gpu_mem_fraction"
  value: {
    string_value: "0.25"
  }
}
parameters: {
  key: "max_num_sequences"
  value: {
    string_value: "${max_num_sequences}"
  }
}
parameters: {
  key: "enable_trt_overlap"
  value: {
    string_value: "${enable_trt_overlap}"
  }
}
parameters: {
  key: "exclude_input_in_output"
  value: {
    string_value: "true"
  }
}

Expected behavior I would expect Triton endpoints to be loaded. Instead, I got an error and here is my logs:

triton-models-1  | =============================
triton-models-1  | == Triton Inference Server ==
triton-models-1  | =============================
triton-models-1  | 
triton-models-1  | NVIDIA Release 24.05 (build 95110614)
triton-models-1  | Triton Server Version 2.46.0
triton-models-1  | 
triton-models-1  | Copyright (c) 2018-2023, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
triton-models-1  | Copyright (c) 2014-2024 Facebook Inc.
triton-models-1  | Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
triton-models-1  | Copyright (c) 2012-2014 Deepmind Technologies    (Koray Kavukcuoglu)
triton-models-1  | Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
triton-models-1  | Copyright (c) 2011-2013 NYU                      (Clement Farabet)
triton-models-1  | Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
triton-models-1  | Copyright (c) 2006      Idiap Research Institute (Samy Bengio)
triton-models-1  | Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
triton-models-1  | Copyright (c) 2015      Google Inc.
triton-models-1  | Copyright (c) 2015      Yangqing Jia
triton-models-1  | Copyright (c) 2013-2016 The Caffe contributors
triton-models-1  | All rights reserved.
triton-models-1  | 
triton-models-1  | Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.
triton-models-1  | 
triton-models-1  | This container image and its contents are governed by the NVIDIA Deep Learning Container License.
triton-models-1  | By pulling and using the container, you accept the terms and conditions of this license:
triton-models-1  | https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
triton-models-1  | 
triton-models-1  | I0618 08:11:13.059145 1 pinned_memory_manager.cc:275] "Pinned memory pool is created at '0x7f5008000000' with size 268435456"
triton-models-1  | I0618 08:11:13.059275 1 cuda_memory_manager.cc:107] "CUDA memory pool is created on device 0 with size 67108864"
triton-models-1  | I0618 08:11:13.060598 1 model_lifecycle.cc:472] "loading: preprocessing:1"
triton-models-1  | I0618 08:11:13.060621 1 model_lifecycle.cc:472] "loading: tensorrt_llm:1"
triton-models-1  | I0618 08:11:13.060629 1 model_lifecycle.cc:472] "loading: postprocessing:1"
triton-models-1  | [TensorRT-LLM][INFO] Initializing MPI with thread mode 3
triton-models-1  | [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set
triton-models-1  | [TensorRT-LLM][WARNING] max_beam_width is not specified, will use default value of 1
triton-models-1  | [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value
triton-models-1  | [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false.
triton-models-1  | [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to false
triton-models-1  | [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true
triton-models-1  | [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length)
triton-models-1  | [TensorRT-LLM][WARNING] enable_kv_cache_reuse is not specified, will be set to false
triton-models-1  | [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05
triton-models-1  | [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB
triton-models-1  | [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead
triton-models-1  | [TensorRT-LLM][INFO] Engine version 0.11.0.dev2024061100 found in the config file, assuming engine(s) built by new builder API.
triton-models-1  | E0618 08:11:13.251485 1 backend_model.cc:692] "ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found"
triton-models-1  | E0618 08:11:13.251518 1 model_lifecycle.cc:641] "failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found"
triton-models-1  | I0618 08:11:13.251528 1 model_lifecycle.cc:776] "failed to load 'tensorrt_llm'"
triton-models-1  | I0618 08:11:15.508688 1 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0)"
triton-models-1  | I0618 08:11:15.764787 1 python_be.cc:2404] "TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0)"
triton-models-1  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
triton-models-1  | I0618 08:11:16.810643 1 model_lifecycle.cc:838] "successfully loaded 'preprocessing'"
triton-models-1  | Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
triton-models-1  | I0618 08:11:17.058231 1 model_lifecycle.cc:838] "successfully loaded 'postprocessing'"
triton-models-1  | E0618 08:11:17.058465 1 model_repository_manager.cc:614] "Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found;"
triton-models-1  | I0618 08:11:17.058509 1 server.cc:606] 
triton-models-1  | +------------------+------+
triton-models-1  | | Repository Agent | Path |
triton-models-1  | +------------------+------+
triton-models-1  | +------------------+------+
triton-models-1  | 
triton-models-1  | I0618 08:11:17.058557 1 server.cc:633] 
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Backend     | Path                                                            | Config                                                                                                                                                        |
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
triton-models-1  | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"true","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size":"4"}} |
triton-models-1  | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0618 08:11:17.058594 1 server.cc:676] 
triton-models-1  | +----------------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Model          | Version | Status                                                                                                                                                      |
triton-models-1  | +----------------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | postprocessing | 1       | READY                                                                                                                                                       |
triton-models-1  | | preprocessing  | 1       | READY                                                                                                                                                       |
triton-models-1  | | tensorrt_llm   | 1       | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'use_context_fmha_for_generation' not found |
triton-models-1  | +----------------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0618 08:11:17.099527 1 metrics.cc:877] "Collecting metrics for GPU 0: NVIDIA GeForce RTX 4090 Laptop GPU"
triton-models-1  | I0618 08:11:17.101077 1 metrics.cc:770] "Collecting CPU metrics"
triton-models-1  | I0618 08:11:17.101202 1 tritonserver.cc:2557] 
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | Option                           | Value                                                                                                                                                                                                           |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | | server_id                        | triton                                                                                                                                                                                                          |
triton-models-1  | | server_version                   | 2.46.0                                                                                                                                                                                                          |
triton-models-1  | | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
triton-models-1  | | model_repository_path[0]         | /models/inflight-batch-llm                                                                                                                                                                                      |
triton-models-1  | | model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
triton-models-1  | | strict_model_config              | 0                                                                                                                                                                                                               |
triton-models-1  | | model_config_name                |                                                                                                                                                                                                                 |
triton-models-1  | | rate_limit                       | OFF                                                                                                                                                                                                             |
triton-models-1  | | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
triton-models-1  | | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                                        |
triton-models-1  | | min_supported_compute_capability | 6.0                                                                                                                                                                                                             |
triton-models-1  | | strict_readiness                 | 1                                                                                                                                                                                                               |
triton-models-1  | | exit_timeout                     | 30                                                                                                                                                                                                              |
triton-models-1  | | cache_enabled                    | 0                                                                                                                                                                                                               |
triton-models-1  | +----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
triton-models-1  | 
triton-models-1  | I0618 08:11:17.101241 1 server.cc:307] "Waiting for in-flight requests to complete."
triton-models-1  | I0618 08:11:17.101247 1 server.cc:323] "Timeout 30: Found 0 model versions that have in-flight inferences"
triton-models-1  | I0618 08:11:17.101570 1 server.cc:338] "All models are stopped, unloading models"
triton-models-1  | I0618 08:11:17.101579 1 server.cc:347] "Timeout 30: Found 2 live models and 0 in-flight non-inference requests"
triton-models-1  | I0618 08:11:18.101795 1 server.cc:347] "Timeout 29: Found 2 live models and 0 in-flight non-inference requests"
triton-models-1  | W0618 08:11:18.107242 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000"
triton-models-1  | Cleaning up...
triton-models-1  | Cleaning up...
triton-models-1  | I0618 08:11:18.353797 1 model_lifecycle.cc:623] "successfully unloaded 'preprocessing' version 1"
triton-models-1  | I0618 08:11:18.479704 1 model_lifecycle.cc:623] "successfully unloaded 'postprocessing' version 1"
triton-models-1  | I0618 08:11:19.102114 1 server.cc:347] "Timeout 28: Found 0 live models and 0 in-flight non-inference requests"
triton-models-1  | W0618 08:11:19.119314 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000"
triton-models-1  | error: creating server: Internal - failed to load all models
triton-models-1  | W0618 08:11:20.120275 1 metrics.cc:631] "Unable to get power limit for GPU 0. Status:Success, value:0.000000"
jasonngap1 commented 1 month ago

An update: used tensorrt_llm==0.10.0 to convert checkpoints and compile model, but currently receiving error regarding: Assertion failed: Failed to deserialize cuda engine when using triton server version nvcr.io/nvidia/tritonserver:24.05-trtllm-python-py3

statiraju commented 1 week ago

The team is looking into the issue and respond asap.

jasonngap1 commented 1 week ago

Hi @statiraju sorry for not updating but I have managed to solve the issue by aligning the tensorrt-llm versions in both the compiling of the model and the trition server. Thanks!