triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
664 stars 96 forks source link

modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found #337

Open shamikatamazon opened 7 months ago

shamikatamazon commented 7 months ago

Description Trying to deploy Mistral-7B with Triton+TensorRT-LLM and running into this issue

Triton Information

Are you using the Triton container or did you build it yourself?

To Reproduce Steps to reproduce the behavior.

Converted raw weights and built engine using instructions from

Tested using and inference works successfully.

Updated all the config.pbtxt files based on the guide

Updated several other parameters in the config.pbtxt files which weren't addressed in the instructions.

Start up the docker container - docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/ubuntu/tensorrtllm_backend:/tensorrtllm_backend -v /home/ubuntu/hf_mistral_weights:/hf_mistral_weights -v /home/ubuntu/tmp/mistral/7B/trt_engines/fp16/1-gpu/:/engines

copy the inflight_batcher files using the command cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.

launch the triton server using python3 /tensorrtllm_backend/scripts/ --world_size=1 --model_repo=/opt/tritonserver/inflight_batcher_llm

Fails with error

I0214 02:34:39.939687 128] CUDA memory pool is created on device 0 with size 67108864 I0214 02:34:39.939702 128] CUDA memory pool is created on device 1 with size 67108864 I0214 02:34:39.939706 128] CUDA memory pool is created on device 2 with size 67108864 I0214 02:34:39.939710 128] CUDA memory pool is created on device 3 with size 67108864 W0214 02:34:40.148747 128] failed to enable peer access for some device pairs I0214 02:34:40.151177 128] loading: postprocessing:1 I0214 02:34:40.151282 128] loading: preprocessing:1 I0214 02:34:40.151367 128] loading: tensorrt_llm:1 I0214 02:34:40.151416 128] loading: tensorrt_llm_bls:1 I0214 02:34:40.161061 128] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) I0214 02:34:40.161084 128] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) I0214 02:34:40.211360 128] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0) [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache [TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true E0214 02:34:40.211909 128] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found E0214 02:34:40.211962 128] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found I0214 02:34:40.211974 128] failed to load 'tensorrt_llm' I0214 02:34:40.455701 128] successfully loaded 'tensorrt_llm_bls' I0214 02:34:40.713354 128] successfully loaded 'postprocessing' I0214 02:34:40.716190 128] successfully loaded 'preprocessing' E0214 02:34:40.716281 128] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found; I0214 02:34:40.716342 128] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0214 02:34:40.716400 128] +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/ | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","m | | | | in-compute-capability":"6.000000","shm-region-prefix-name":"prefix0","default-max-batch-size" | | | | :"4"}} | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/ | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","m | | | | in-compute-capability":"6.000000","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+

I0214 02:34:40.716454 128] +------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Model | Version | Status | +------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+ | postprocessing | 1 | READY | | preprocessing | 1 | READY | | tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found | | tensorrt_llm_bls | 1 | READY | +------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+

I0214 02:34:40.794660 128] Collecting metrics for GPU 0: NVIDIA A10G I0214 02:34:40.794698 128] Collecting metrics for GPU 1: NVIDIA A10G I0214 02:34:40.794704 128] Collecting metrics for GPU 2: NVIDIA A10G I0214 02:34:40.794710 128] Collecting metrics for GPU 3: NVIDIA A10G I0214 02:34:40.794943 128] Collecting CPU metrics I0214 02:34:40.795147 128] +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.39.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_s | | | hared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /opt/tritonserver/inflight_batcher_llm | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | cuda_memory_pool_byte_size{2} | 67108864 | | cuda_memory_pool_byte_size{3} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+

I0214 02:34:40.795170 128] Waiting for in-flight requests to complete. I0214 02:34:40.795178 128] Timeout 30: Found 0 model versions that have in-flight inferences I0214 02:34:40.795629 128] All models are stopped, unloading models I0214 02:34:40.795646 128] Timeout 30: Found 3 live models and 0 in-flight non-inference requests I0214 02:34:41.795733 128] Timeout 29: Found 3 live models and 0 in-flight non-inference requests Cleaning up... Cleaning up... Cleaning up... I0214 02:34:41.831995 128] successfully unloaded 'tensorrt_llm_bls' version 1 I0214 02:34:42.045779 128] successfully unloaded 'postprocessing' version 1 I0214 02:34:42.054175 128] successfully unloaded 'preprocessing' version 1 I0214 02:34:42.795816 128] Timeout 28: Found 0 live models and 0 in-flight non-inference requests error: creating server: Internal - failed to load all models

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[16837,1],0] Exit code: 1

Expected behavior Model should be deployed successfully and be ready for inference

nnshah1 commented 7 months ago

Question: Can you confirm the path specified in your configpb.txt for the GPT model directory has a 'config.json' file with a builder_config?

As example:

parameters: {
 243 │   key: "gpt_model_path"
 244 │   value: {
 245 │     string_value: "/tmp/engines/llama-2-7b"
 246 │   }
 247 │ }

Contents of /tmp/engines/llama-2-7b:

   2 │   total used in directory 6844580 available 824.9 GiB
   3 │   drwxr-xr-x 3 root    root       4096 Feb  2 06:38 .
   4 │   drwxr-xr-x 4 neelays dip        4096 Feb  2 06:34 ..
   5 │   -rw-r--r-- 1 root    root       1566 Feb  2 06:38 config.json
   6 │   -rw-r--r-- 1 root    root 7008297012 Feb  2 06:38 llama_float16_tp1_rank0.engine
   7 │   -rw-r--r-- 1 root    root     527246 Feb  2 06:38 model.cache
   8 │   drwxr-xr-x 2 root    root       4096 Feb  2 06:34 tokenizer

contents of config.json

   2 │   "builder_config": {
   3 │     "gather_all_token_logits": false,
   4 │     "hidden_act": "silu",
   5 │     "hidden_size": 4096,
   6 │     "int8": true,
   7 │     "lora_target_modules": [],
   8 │     "max_batch_size": 64,
   9 │     "max_beam_width": 1,
  10 │     "max_input_len": 2048,
  11 │     "max_num_tokens": null,
  12 │     "max_output_len": 512,
  13 │     "max_position_embeddings": 4096,
  14 │     "max_prompt_embedding_table_size": 0,
  15 │     "name": "llama",
  16 │     "num_heads": 32,
  17 │     "num_kv_heads": 32,
  18 │     "num_layers": 32,
  19 │     "parallel_build": false,
  20 │     "pipeline_parallel": 1,
  21 │     "precision": "float16",
  22 │     "quant_mode": 2,
  23 │     "tensor_parallel": 1,
  24 │     "use_refit": false,
  25 │     "vocab_size": 32000
  26 │   },
shamikatamazon commented 7 months ago

Question: Can you confirm the path specified in your configpb.txt for the GPT model directory has a 'config.json' file with a builder_config?

As example:

parameters: {
 243 │   key: "gpt_model_path"
 244 │   value: {
 245 │     string_value: "/tmp/engines/llama-2-7b"
 246 │   }
 247 │ }

Contents of /tmp/engines/llama-2-7b:

   2 │   total used in directory 6844580 available 824.9 GiB
   3 │   drwxr-xr-x 3 root    root       4096 Feb  2 06:38 .
   4 │   drwxr-xr-x 4 neelays dip        4096 Feb  2 06:34 ..
   5 │   -rw-r--r-- 1 root    root       1566 Feb  2 06:38 config.json
   6 │   -rw-r--r-- 1 root    root 7008297012 Feb  2 06:38 llama_float16_tp1_rank0.engine
   7 │   -rw-r--r-- 1 root    root     527246 Feb  2 06:38 model.cache
   8 │   drwxr-xr-x 2 root    root       4096 Feb  2 06:34 tokenizer

contents of config.json

   2 │   "builder_config": {
   3 │     "gather_all_token_logits": false,
   4 │     "hidden_act": "silu",
   5 │     "hidden_size": 4096,
   6 │     "int8": true,
   7 │     "lora_target_modules": [],
   8 │     "max_batch_size": 64,
   9 │     "max_beam_width": 1,
  10 │     "max_input_len": 2048,
  11 │     "max_num_tokens": null,
  12 │     "max_output_len": 512,
  13 │     "max_position_embeddings": 4096,
  14 │     "max_prompt_embedding_table_size": 0,
  15 │     "name": "llama",
  16 │     "num_heads": 32,
  17 │     "num_kv_heads": 32,
  18 │     "num_layers": 32,
  19 │     "parallel_build": false,
  20 │     "pipeline_parallel": 1,
  21 │     "precision": "float16",
  22 │     "quant_mode": 2,
  23 │     "tensor_parallel": 1,
  24 │     "use_refit": false,
  25 │     "vocab_size": 32000
  26 │   },

@nnshah1 Thanks for your help - Interestingly enough, the config.json has a "build_config" - not a builder_config section

"build_config": {
        "max_input_len": 32256,
        "max_output_len": 1024,
        "max_batch_size": 1,
        "max_beam_width": 1,
        "max_num_tokens": 32256,
        "max_prompt_embedding_table_size": 0,
        "gather_context_logits": false,
        "gather_generation_logits": false,
        "strongly_typed": false,
        "builder_opt": null,
        "profiling_verbosity": "layer_names_only",
        "plugin_config": {
            "bert_attention_plugin": "float16",
            "gpt_attention_plugin": "float16",
            "gemm_plugin": "float16",
            "smooth_quant_gemm_plugin": null,
            "identity_plugin": null,
            "layernorm_quantization_plugin": null,
            "rmsnorm_quantization_plugin": null,
            "nccl_plugin": null,
            "lookup_plugin": null,
            "lora_plugin": null,
            "weight_only_groupwise_quant_matmul_plugin": null,
            "weight_only_quant_matmul_plugin": null,
            "quantize_per_token_plugin": false,
            "quantize_tensor_plugin": false,
            "context_fmha": true,
            "context_fmha_fp32_acc": false,
            "paged_kv_cache": true,
            "remove_input_padding": true,
            "use_custom_all_reduce": true,
            "multi_block_mode": false,
            "enable_xqa": true,
            "attention_qk_half_accumulation": false,
            "tokens_per_block": 128,
            "use_paged_context_fmha": false,
            "use_context_fmha_for_generation": false
nnshah1 commented 7 months ago

@shamikatamazon - transferred to backend repo where triage may be faster - will monitor.

kkkumar2 commented 7 months ago

@nnshah1 I am facing the same issue , were you able to fix the issue ? i renamed build_config to builder_config and even added all the parameters that were showing missing , still new issues are popping up .

Could you please help me with this ?

phillip-kravtsov commented 7 months ago

I was running into this issue, I'm guessing it's due to a version mismatch between the trition inference server and the tensorrtllm_backend. I found that building the docker container from source on main from this repo e.g. these instructions and then building the trt image + running the server in that container worked.

I was having issues with the official containers including another (

phillip-kravtsov commented 7 months ago

Also see

iibw commented 6 months ago

I've traced back this issue to a binary blob delivered with TensorRT-LLM located here.

Triton inference server calls tensorrtllm_backend which parses the tensorrt_llm/config.pbtxt, obtains the gpt_model_path, uses it to find config.json and confirms it is valid json, then creates a new GptManager object with the config.pbtxt information. GptManager points to tensorrt_llm/batch_manager/GptManager.h, but when you go there, you find the binary blob I mentioned earlier. Seems like the solution is going to be using an older version of TensorRT-LLM. One with an older version of that binary blob or maybe the binary blob includes another file like gptJsonConfig.h. I don't think it is actually gptJsonConfig.h because if it was, my config.json would cause a check for the build_config key instead of the builder_config key and work as intended.

iibw commented 6 months ago

Ok I figured out the issue. I followed the instructions in the README and here which use the 23.10 Triton NGC Docker container. That container was made 5 months ago for TensorRT-LLM v0.5.0. We're on v0.8.0 now.

Using the latest 24.02 Triton NGC Docker container which supports v0.8.0, TensorRT-LLM v0.8.0, and tensorrtllm_backend pegged to the v0.8.0 tag fixed all my problems. The documentation should probably be updated for using v0.8.0. Little reason to use documentation for a version released 5 months ago.

geraldstanje commented 5 months ago

@iibw do you have a tutorial you used to build it with 24.02 Triton NGC Docker container which supports v0.8.0? i also want to try llama 2 model.

did you also follow this doc ?

iibw commented 5 months ago

@iibw do you have a tutorial you used to build it with 24.02 Triton NGC Docker container which supports v0.8.0? i also want to try llama 2 model.

did you also follow this doc ?

iirc I cloned the repo at the v0.8.0 tag, installed the v0.8.0 version of TensorRT-LLM via pip, built the LLM engine (all done locally, not in a container), then ran docker with the 24.02 Triton TRTLLM pre-built image, mounted the LLM engine & config files to the docker image, and ran Triton in the container pointing to the mounted LLM engine & config files.

geraldstanje commented 5 months ago

@iibw whats you host os? does it also work with ubuntu 20.04 as the host os? did you see this tutorial:

iibw commented 5 months ago

@geraldstanje I used Ubuntu. I believe it was 20.04 as well. And no, it didn't exist when I was working on this.