triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
664 stars 96 forks source link

modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found #337

Open shamikatamazon opened 7 months ago

shamikatamazon commented 7 months ago

Description Trying to deploy Mistral-7B with Triton+TensorRT-LLM and running into this issue

Triton Information

Are you using the Triton container or did you build it yourself? nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

To Reproduce Steps to reproduce the behavior.

Converted raw weights and built engine using instructions from https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#mistral-v01

Tested using run.py and inference works successfully.

Updated all the config.pbtxt files based on the guide https://github.com/triton-inference-server/tutorials/blob/main/Popular_Models_Guide/Llama2/trtllm_guide.md

Updated several other parameters in the config.pbtxt files which weren't addressed in the instructions.

Start up the docker container - docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/ubuntu/tensorrtllm_backend:/tensorrtllm_backend -v /home/ubuntu/hf_mistral_weights:/hf_mistral_weights -v /home/ubuntu/tmp/mistral/7B/trt_engines/fp16/1-gpu/:/engines nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3

copy the inflight_batcher files using the command cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.

launch the triton server using python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/opt/tritonserver/inflight_batcher_llm

Fails with error

I0214 02:34:39.939687 128 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0214 02:34:39.939702 128 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0214 02:34:39.939706 128 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0214 02:34:39.939710 128 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 W0214 02:34:40.148747 128 server.cc:238] failed to enable peer access for some device pairs I0214 02:34:40.151177 128 model_lifecycle.cc:461] loading: postprocessing:1 I0214 02:34:40.151282 128 model_lifecycle.cc:461] loading: preprocessing:1 I0214 02:34:40.151367 128 model_lifecycle.cc:461] loading: tensorrt_llm:1 I0214 02:34:40.151416 128 model_lifecycle.cc:461] loading: tensorrt_llm_bls:1 I0214 02:34:40.161061 128 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) I0214 02:34:40.161084 128 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) I0214 02:34:40.211360 128 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0) [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache [TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true E0214 02:34:40.211909 128 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found E0214 02:34:40.211962 128 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found I0214 02:34:40.211974 128 model_lifecycle.cc:756] failed to load 'tensorrt_llm' I0214 02:34:40.455701 128 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm_bls' I0214 02:34:40.713354 128 model_lifecycle.cc:818] successfully loaded 'postprocessing' I0214 02:34:40.716190 128 model_lifecycle.cc:818] successfully loaded 'preprocessing' E0214 02:34:40.716281 128 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found; I0214 02:34:40.716342 128 server.cc:592] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0214 02:34:40.716400 128 server.cc:619] +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtritonpython.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","m | | | | in-compute-capability":"6.000000","shm-region-prefix-name":"prefix0","default-max-batch-size" | | | | :"4"}} | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","m | | | | in-compute-capability":"6.000000","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+

I0214 02:34:40.716454 128 server.cc:662] +------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Model | Version | Status | +------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+ | postprocessing | 1 | READY | | preprocessing | 1 | READY | | tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found | | tensorrt_llm_bls | 1 | READY | +------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+

I0214 02:34:40.794660 128 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A10G I0214 02:34:40.794698 128 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA A10G I0214 02:34:40.794704 128 metrics.cc:817] Collecting metrics for GPU 2: NVIDIA A10G I0214 02:34:40.794710 128 metrics.cc:817] Collecting metrics for GPU 3: NVIDIA A10G I0214 02:34:40.794943 128 metrics.cc:710] Collecting CPU metrics I0214 02:34:40.795147 128 tritonserver.cc:2458] +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.39.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_s | | | hared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /opt/tritonserver/inflight_batcher_llm | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | cuda_memory_pool_byte_size{2} | 67108864 | | cuda_memory_pool_byte_size{3} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+

I0214 02:34:40.795170 128 server.cc:293] Waiting for in-flight requests to complete. I0214 02:34:40.795178 128 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences I0214 02:34:40.795629 128 server.cc:324] All models are stopped, unloading models I0214 02:34:40.795646 128 server.cc:331] Timeout 30: Found 3 live models and 0 in-flight non-inference requests I0214 02:34:41.795733 128 server.cc:331] Timeout 29: Found 3 live models and 0 in-flight non-inference requests Cleaning up... Cleaning up... Cleaning up... I0214 02:34:41.831995 128 model_lifecycle.cc:603] successfully unloaded 'tensorrt_llm_bls' version 1 I0214 02:34:42.045779 128 model_lifecycle.cc:603] successfully unloaded 'postprocessing' version 1 I0214 02:34:42.054175 128 model_lifecycle.cc:603] successfully unloaded 'preprocessing' version 1 I0214 02:34:42.795816 128 server.cc:331] Timeout 28: Found 0 live models and 0 in-flight non-inference requests error: creating server: Internal - failed to load all models

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[16837,1],0] Exit code: 1

Expected behavior Model should be deployed successfully and be ready for inference

nnshah1 commented 7 months ago

Question: Can you confirm the path specified in your configpb.txt for the GPT model directory has a 'config.json' file with a builder_config?

As example:

parameters: {
 243 │   key: "gpt_model_path"
 244 │   value: {
 245 │     string_value: "/tmp/engines/llama-2-7b"
 246 │   }
 247 │ }

Contents of /tmp/engines/llama-2-7b:

   /tmp/engines/llama-2-7b:
   2 │   total used in directory 6844580 available 824.9 GiB
   3 │   drwxr-xr-x 3 root    root       4096 Feb  2 06:38 .
   4 │   drwxr-xr-x 4 neelays dip        4096 Feb  2 06:34 ..
   5 │   -rw-r--r-- 1 root    root       1566 Feb  2 06:38 config.json
   6 │   -rw-r--r-- 1 root    root 7008297012 Feb  2 06:38 llama_float16_tp1_rank0.engine
   7 │   -rw-r--r-- 1 root    root     527246 Feb  2 06:38 model.cache
   8 │   drwxr-xr-x 2 root    root       4096 Feb  2 06:34 tokenizer

contents of config.json

{
   2 │   "builder_config": {
   3 │     "gather_all_token_logits": false,
   4 │     "hidden_act": "silu",
   5 │     "hidden_size": 4096,
   6 │     "int8": true,
   7 │     "lora_target_modules": [],
   8 │     "max_batch_size": 64,
   9 │     "max_beam_width": 1,
  10 │     "max_input_len": 2048,
  11 │     "max_num_tokens": null,
  12 │     "max_output_len": 512,
  13 │     "max_position_embeddings": 4096,
  14 │     "max_prompt_embedding_table_size": 0,
  15 │     "name": "llama",
  16 │     "num_heads": 32,
  17 │     "num_kv_heads": 32,
  18 │     "num_layers": 32,
  19 │     "parallel_build": false,
  20 │     "pipeline_parallel": 1,
  21 │     "precision": "float16",
  22 │     "quant_mode": 2,
  23 │     "tensor_parallel": 1,
  24 │     "use_refit": false,
  25 │     "vocab_size": 32000
  26 │   },
shamikatamazon commented 7 months ago

Question: Can you confirm the path specified in your configpb.txt for the GPT model directory has a 'config.json' file with a builder_config?

As example:

parameters: {
 243 │   key: "gpt_model_path"
 244 │   value: {
 245 │     string_value: "/tmp/engines/llama-2-7b"
 246 │   }
 247 │ }

Contents of /tmp/engines/llama-2-7b:

   /tmp/engines/llama-2-7b:
   2 │   total used in directory 6844580 available 824.9 GiB
   3 │   drwxr-xr-x 3 root    root       4096 Feb  2 06:38 .
   4 │   drwxr-xr-x 4 neelays dip        4096 Feb  2 06:34 ..
   5 │   -rw-r--r-- 1 root    root       1566 Feb  2 06:38 config.json
   6 │   -rw-r--r-- 1 root    root 7008297012 Feb  2 06:38 llama_float16_tp1_rank0.engine
   7 │   -rw-r--r-- 1 root    root     527246 Feb  2 06:38 model.cache
   8 │   drwxr-xr-x 2 root    root       4096 Feb  2 06:34 tokenizer

contents of config.json

{
   2 │   "builder_config": {
   3 │     "gather_all_token_logits": false,
   4 │     "hidden_act": "silu",
   5 │     "hidden_size": 4096,
   6 │     "int8": true,
   7 │     "lora_target_modules": [],
   8 │     "max_batch_size": 64,
   9 │     "max_beam_width": 1,
  10 │     "max_input_len": 2048,
  11 │     "max_num_tokens": null,
  12 │     "max_output_len": 512,
  13 │     "max_position_embeddings": 4096,
  14 │     "max_prompt_embedding_table_size": 0,
  15 │     "name": "llama",
  16 │     "num_heads": 32,
  17 │     "num_kv_heads": 32,
  18 │     "num_layers": 32,
  19 │     "parallel_build": false,
  20 │     "pipeline_parallel": 1,
  21 │     "precision": "float16",
  22 │     "quant_mode": 2,
  23 │     "tensor_parallel": 1,
  24 │     "use_refit": false,
  25 │     "vocab_size": 32000
  26 │   },

@nnshah1 Thanks for your help - Interestingly enough, the config.json has a "build_config" - not a builder_config section

"build_config": {
        "max_input_len": 32256,
        "max_output_len": 1024,
        "max_batch_size": 1,
        "max_beam_width": 1,
        "max_num_tokens": 32256,
        "max_prompt_embedding_table_size": 0,
        "gather_context_logits": false,
        "gather_generation_logits": false,
        "strongly_typed": false,
        "builder_opt": null,
        "profiling_verbosity": "layer_names_only",
        "plugin_config": {
            "bert_attention_plugin": "float16",
            "gpt_attention_plugin": "float16",
            "gemm_plugin": "float16",
            "smooth_quant_gemm_plugin": null,
            "identity_plugin": null,
            "layernorm_quantization_plugin": null,
            "rmsnorm_quantization_plugin": null,
            "nccl_plugin": null,
            "lookup_plugin": null,
            "lora_plugin": null,
            "weight_only_groupwise_quant_matmul_plugin": null,
            "weight_only_quant_matmul_plugin": null,
            "quantize_per_token_plugin": false,
            "quantize_tensor_plugin": false,
            "context_fmha": true,
            "context_fmha_fp32_acc": false,
            "paged_kv_cache": true,
            "remove_input_padding": true,
            "use_custom_all_reduce": true,
            "multi_block_mode": false,
            "enable_xqa": true,
            "attention_qk_half_accumulation": false,
            "tokens_per_block": 128,
            "use_paged_context_fmha": false,
            "use_context_fmha_for_generation": false
        }
    }
nnshah1 commented 7 months ago

@shamikatamazon - transferred to backend repo where triage may be faster - will monitor.

kkkumar2 commented 7 months ago

@nnshah1 I am facing the same issue , were you able to fix the issue ? i renamed build_config to builder_config and even added all the parameters that were showing missing , still new issues are popping up .

Could you please help me with this ?

phillip-kravtsov commented 7 months ago

I was running into this issue, I'm guessing it's due to a version mismatch between the trition inference server and the tensorrtllm_backend. I found that building the docker container from source on main from this repo e.g. these instructions https://github.com/kshitizgupta21/triton_trtllm_guide/blob/main/README_GPT_BYOCTritonTRTLLM.md and then building the trt image + running the server in that container worked.

I was having issues with the official containers including another (https://github.com/triton-inference-server/tensorrtllm_backend/issues/246)

phillip-kravtsov commented 7 months ago

Also see https://github.com/triton-inference-server/tensorrtllm_backend/issues/330

iibw commented 6 months ago

I've traced back this issue to a binary blob delivered with TensorRT-LLM located here.

Triton inference server calls tensorrtllm_backend which parses the tensorrt_llm/config.pbtxt, obtains the gpt_model_path, uses it to find config.json and confirms it is valid json, then creates a new GptManager object with the config.pbtxt information. https://github.com/triton-inference-server/tensorrtllm_backend/blob/49def341ca37e0db3dc8c80c99da824107a7a938/inflight_batcher_llm/src/model_instance_state.cc#L293 GptManager points to tensorrt_llm/batch_manager/GptManager.h, but when you go there, you find the binary blob I mentioned earlier. https://github.com/triton-inference-server/tensorrtllm_backend/blob/49def341ca37e0db3dc8c80c99da824107a7a938/inflight_batcher_llm/src/model_instance_state.h#L37 Seems like the solution is going to be using an older version of TensorRT-LLM. One with an older version of that binary blob or maybe the binary blob includes another file like gptJsonConfig.h. I don't think it is actually gptJsonConfig.h because if it was, my config.json would cause a check for the build_config key instead of the builder_config key and work as intended.

iibw commented 6 months ago

Ok I figured out the issue. I followed the instructions in the README and here which use the 23.10 Triton NGC Docker container. That container was made 5 months ago for TensorRT-LLM v0.5.0. We're on v0.8.0 now.

Using the latest 24.02 Triton NGC Docker container which supports v0.8.0, TensorRT-LLM v0.8.0, and tensorrtllm_backend pegged to the v0.8.0 tag fixed all my problems. The documentation should probably be updated for using v0.8.0. Little reason to use documentation for a version released 5 months ago.

geraldstanje commented 5 months ago

@iibw do you have a tutorial you used to build it with 24.02 Triton NGC Docker container which supports v0.8.0? i also want to try llama 2 model.

did you also follow this doc https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md ?

iibw commented 5 months ago

@iibw do you have a tutorial you used to build it with 24.02 Triton NGC Docker container which supports v0.8.0? i also want to try llama 2 model.

did you also follow this doc https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md ?

iirc I cloned the repo at the v0.8.0 tag, installed the v0.8.0 version of TensorRT-LLM via pip, built the LLM engine (all done locally, not in a container), then ran docker with the 24.02 Triton TRTLLM pre-built image, mounted the LLM engine & config files to the docker image, and ran Triton in the container pointing to the mounted LLM engine & config files.

geraldstanje commented 5 months ago

@iibw whats you host os? does it also work with ubuntu 20.04 as the host os? did you see this tutorial: https://medium.com/trendyol-tech/deploying-a-large-language-model-llm-with-tensorrt-llm-on-triton-inference-server-a-step-by-step-d53fccc856fa?

iibw commented 5 months ago

@geraldstanje I used Ubuntu. I believe it was 20.04 as well. And no, it didn't exist when I was working on this.