Open shamikatamazon opened 7 months ago
Question: Can you confirm the path specified in your configpb.txt for the GPT model directory has a 'config.json' file with a builder_config?
As example:
parameters: {
243 │ key: "gpt_model_path"
244 │ value: {
245 │ string_value: "/tmp/engines/llama-2-7b"
246 │ }
247 │ }
Contents of /tmp/engines/llama-2-7b
:
/tmp/engines/llama-2-7b:
2 │ total used in directory 6844580 available 824.9 GiB
3 │ drwxr-xr-x 3 root root 4096 Feb 2 06:38 .
4 │ drwxr-xr-x 4 neelays dip 4096 Feb 2 06:34 ..
5 │ -rw-r--r-- 1 root root 1566 Feb 2 06:38 config.json
6 │ -rw-r--r-- 1 root root 7008297012 Feb 2 06:38 llama_float16_tp1_rank0.engine
7 │ -rw-r--r-- 1 root root 527246 Feb 2 06:38 model.cache
8 │ drwxr-xr-x 2 root root 4096 Feb 2 06:34 tokenizer
contents of config.json
{
2 │ "builder_config": {
3 │ "gather_all_token_logits": false,
4 │ "hidden_act": "silu",
5 │ "hidden_size": 4096,
6 │ "int8": true,
7 │ "lora_target_modules": [],
8 │ "max_batch_size": 64,
9 │ "max_beam_width": 1,
10 │ "max_input_len": 2048,
11 │ "max_num_tokens": null,
12 │ "max_output_len": 512,
13 │ "max_position_embeddings": 4096,
14 │ "max_prompt_embedding_table_size": 0,
15 │ "name": "llama",
16 │ "num_heads": 32,
17 │ "num_kv_heads": 32,
18 │ "num_layers": 32,
19 │ "parallel_build": false,
20 │ "pipeline_parallel": 1,
21 │ "precision": "float16",
22 │ "quant_mode": 2,
23 │ "tensor_parallel": 1,
24 │ "use_refit": false,
25 │ "vocab_size": 32000
26 │ },
Question: Can you confirm the path specified in your configpb.txt for the GPT model directory has a 'config.json' file with a builder_config?
As example:
parameters: { 243 │ key: "gpt_model_path" 244 │ value: { 245 │ string_value: "/tmp/engines/llama-2-7b" 246 │ } 247 │ }
Contents of
/tmp/engines/llama-2-7b
:/tmp/engines/llama-2-7b: 2 │ total used in directory 6844580 available 824.9 GiB 3 │ drwxr-xr-x 3 root root 4096 Feb 2 06:38 . 4 │ drwxr-xr-x 4 neelays dip 4096 Feb 2 06:34 .. 5 │ -rw-r--r-- 1 root root 1566 Feb 2 06:38 config.json 6 │ -rw-r--r-- 1 root root 7008297012 Feb 2 06:38 llama_float16_tp1_rank0.engine 7 │ -rw-r--r-- 1 root root 527246 Feb 2 06:38 model.cache 8 │ drwxr-xr-x 2 root root 4096 Feb 2 06:34 tokenizer
contents of
config.json
{ 2 │ "builder_config": { 3 │ "gather_all_token_logits": false, 4 │ "hidden_act": "silu", 5 │ "hidden_size": 4096, 6 │ "int8": true, 7 │ "lora_target_modules": [], 8 │ "max_batch_size": 64, 9 │ "max_beam_width": 1, 10 │ "max_input_len": 2048, 11 │ "max_num_tokens": null, 12 │ "max_output_len": 512, 13 │ "max_position_embeddings": 4096, 14 │ "max_prompt_embedding_table_size": 0, 15 │ "name": "llama", 16 │ "num_heads": 32, 17 │ "num_kv_heads": 32, 18 │ "num_layers": 32, 19 │ "parallel_build": false, 20 │ "pipeline_parallel": 1, 21 │ "precision": "float16", 22 │ "quant_mode": 2, 23 │ "tensor_parallel": 1, 24 │ "use_refit": false, 25 │ "vocab_size": 32000 26 │ },
@nnshah1 Thanks for your help - Interestingly enough, the config.json has a "build_config" - not a builder_config section
"build_config": {
"max_input_len": 32256,
"max_output_len": 1024,
"max_batch_size": 1,
"max_beam_width": 1,
"max_num_tokens": 32256,
"max_prompt_embedding_table_size": 0,
"gather_context_logits": false,
"gather_generation_logits": false,
"strongly_typed": false,
"builder_opt": null,
"profiling_verbosity": "layer_names_only",
"plugin_config": {
"bert_attention_plugin": "float16",
"gpt_attention_plugin": "float16",
"gemm_plugin": "float16",
"smooth_quant_gemm_plugin": null,
"identity_plugin": null,
"layernorm_quantization_plugin": null,
"rmsnorm_quantization_plugin": null,
"nccl_plugin": null,
"lookup_plugin": null,
"lora_plugin": null,
"weight_only_groupwise_quant_matmul_plugin": null,
"weight_only_quant_matmul_plugin": null,
"quantize_per_token_plugin": false,
"quantize_tensor_plugin": false,
"context_fmha": true,
"context_fmha_fp32_acc": false,
"paged_kv_cache": true,
"remove_input_padding": true,
"use_custom_all_reduce": true,
"multi_block_mode": false,
"enable_xqa": true,
"attention_qk_half_accumulation": false,
"tokens_per_block": 128,
"use_paged_context_fmha": false,
"use_context_fmha_for_generation": false
}
}
@shamikatamazon - transferred to backend repo where triage may be faster - will monitor.
@nnshah1 I am facing the same issue , were you able to fix the issue ? i renamed build_config to builder_config and even added all the parameters that were showing missing , still new issues are popping up .
Could you please help me with this ?
I was running into this issue, I'm guessing it's due to a version mismatch between the trition inference server and the tensorrtllm_backend. I found that building the docker container from source on main
from this repo e.g. these instructions https://github.com/kshitizgupta21/triton_trtllm_guide/blob/main/README_GPT_BYOCTritonTRTLLM.md and then building the trt image + running the server in that container worked.
I was having issues with the official containers including another (https://github.com/triton-inference-server/tensorrtllm_backend/issues/246)
I've traced back this issue to a binary blob delivered with TensorRT-LLM located here.
Triton inference server calls tensorrtllm_backend which parses the tensorrt_llm/config.pbtxt
, obtains the gpt_model_path
, uses it to find config.json
and confirms it is valid json, then creates a new GptManager
object with the config.pbtxt
information.
https://github.com/triton-inference-server/tensorrtllm_backend/blob/49def341ca37e0db3dc8c80c99da824107a7a938/inflight_batcher_llm/src/model_instance_state.cc#L293 GptManager
points to tensorrt_llm/batch_manager/GptManager.h
, but when you go there, you find the binary blob I mentioned earlier.
https://github.com/triton-inference-server/tensorrtllm_backend/blob/49def341ca37e0db3dc8c80c99da824107a7a938/inflight_batcher_llm/src/model_instance_state.h#L37 Seems like the solution is going to be using an older version of TensorRT-LLM. One with an older version of that binary blob or maybe the binary blob includes another file like gptJsonConfig.h
. I don't think it is actually gptJsonConfig.h
because if it was, my config.json
would cause a check for the build_config
key instead of the builder_config
key and work as intended.
Ok I figured out the issue. I followed the instructions in the README and here which use the 23.10 Triton NGC Docker container. That container was made 5 months ago for TensorRT-LLM v0.5.0. We're on v0.8.0 now.
Using the latest 24.02 Triton NGC Docker container which supports v0.8.0, TensorRT-LLM v0.8.0, and tensorrtllm_backend pegged to the v0.8.0 tag fixed all my problems. The documentation should probably be updated for using v0.8.0. Little reason to use documentation for a version released 5 months ago.
@iibw do you have a tutorial you used to build it with 24.02 Triton NGC Docker container which supports v0.8.0? i also want to try llama 2 model.
did you also follow this doc https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md ?
@iibw do you have a tutorial you used to build it with 24.02 Triton NGC Docker container which supports v0.8.0? i also want to try llama 2 model.
did you also follow this doc https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md ?
iirc I cloned the repo at the v0.8.0 tag, installed the v0.8.0 version of TensorRT-LLM via pip, built the LLM engine (all done locally, not in a container), then ran docker with the 24.02 Triton TRTLLM pre-built image, mounted the LLM engine & config files to the docker image, and ran Triton in the container pointing to the mounted LLM engine & config files.
@iibw whats you host os? does it also work with ubuntu 20.04 as the host os? did you see this tutorial: https://medium.com/trendyol-tech/deploying-a-large-language-model-llm-with-tensorrt-llm-on-triton-inference-server-a-step-by-step-d53fccc856fa?
@geraldstanje I used Ubuntu. I believe it was 20.04 as well. And no, it didn't exist when I was working on this.
Description Trying to deploy Mistral-7B with Triton+TensorRT-LLM and running into this issue
Triton Information
Are you using the Triton container or did you build it yourself? nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
To Reproduce Steps to reproduce the behavior.
Converted raw weights and built engine using instructions from https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/llama#mistral-v01
Tested using run.py and inference works successfully.
Updated all the config.pbtxt files based on the guide https://github.com/triton-inference-server/tutorials/blob/main/Popular_Models_Guide/Llama2/trtllm_guide.md
Updated several other parameters in the config.pbtxt files which weren't addressed in the instructions.
Start up the docker container - docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /home/ubuntu/tensorrtllm_backend:/tensorrtllm_backend -v /home/ubuntu/hf_mistral_weights:/hf_mistral_weights -v /home/ubuntu/tmp/mistral/7B/trt_engines/fp16/1-gpu/:/engines nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3
copy the inflight_batcher files using the command cp -R /tensorrtllm_backend/all_models/inflight_batcher_llm /opt/tritonserver/.
launch the triton server using python3 /tensorrtllm_backend/scripts/launch_triton_server.py --world_size=1 --model_repo=/opt/tritonserver/inflight_batcher_llm
Fails with error
I0214 02:34:39.939687 128 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0214 02:34:39.939702 128 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0214 02:34:39.939706 128 cuda_memory_manager.cc:107] CUDA memory pool is created on device 2 with size 67108864 I0214 02:34:39.939710 128 cuda_memory_manager.cc:107] CUDA memory pool is created on device 3 with size 67108864 W0214 02:34:40.148747 128 server.cc:238] failed to enable peer access for some device pairs I0214 02:34:40.151177 128 model_lifecycle.cc:461] loading: postprocessing:1 I0214 02:34:40.151282 128 model_lifecycle.cc:461] loading: preprocessing:1 I0214 02:34:40.151367 128 model_lifecycle.cc:461] loading: tensorrt_llm:1 I0214 02:34:40.151416 128 model_lifecycle.cc:461] loading: tensorrt_llm_bls:1 I0214 02:34:40.161061 128 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) I0214 02:34:40.161084 128 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) I0214 02:34:40.211360 128 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0) [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache [TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true E0214 02:34:40.211909 128 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found E0214 02:34:40.211962 128 model_lifecycle.cc:621] failed to load 'tensorrt_llm' version 1: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found I0214 02:34:40.211974 128 model_lifecycle.cc:756] failed to load 'tensorrt_llm' I0214 02:34:40.455701 128 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm_bls' I0214 02:34:40.713354 128 model_lifecycle.cc:818] successfully loaded 'postprocessing' I0214 02:34:40.716190 128 model_lifecycle.cc:818] successfully loaded 'preprocessing' E0214 02:34:40.716281 128 model_repository_manager.cc:563] Invalid argument: ensemble 'ensemble' depends on 'tensorrt_llm' which has no loaded version. Model 'tensorrt_llm' loading failed with error: version 1 is at UNAVAILABLE state: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found; I0214 02:34:40.716342 128 server.cc:592] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+
I0214 02:34:40.716400 128 server.cc:619] +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtritonpython.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","m | | | | in-compute-capability":"6.000000","shm-region-prefix-name":"prefix0","default-max-batch-size" | | | | :"4"}} | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","m | | | | in-compute-capability":"6.000000","default-max-batch-size":"4"}} | +-------------+-----------------------------------------------------------------+------------------------------------------------------------------------------------------------+
I0214 02:34:40.716454 128 server.cc:662] +------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+ | Model | Version | Status | +------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+ | postprocessing | 1 | READY | | preprocessing | 1 | READY | | tensorrt_llm | 1 | UNAVAILABLE: Internal: unexpected error when creating modelInstanceState: [json.exception.out_of_range.403] key 'builder_config' not found | | tensorrt_llm_bls | 1 | READY | +------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------+
I0214 02:34:40.794660 128 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A10G I0214 02:34:40.794698 128 metrics.cc:817] Collecting metrics for GPU 1: NVIDIA A10G I0214 02:34:40.794704 128 metrics.cc:817] Collecting metrics for GPU 2: NVIDIA A10G I0214 02:34:40.794710 128 metrics.cc:817] Collecting metrics for GPU 3: NVIDIA A10G I0214 02:34:40.794943 128 metrics.cc:710] Collecting CPU metrics I0214 02:34:40.795147 128 tritonserver.cc:2458] +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.39.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_s | | | hared_memory binary_tensor_data parameters statistics trace logging | | model_repository_path[0] | /opt/tritonserver/inflight_batcher_llm | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | cuda_memory_pool_byte_size{1} | 67108864 | | cuda_memory_pool_byte_size{2} | 67108864 | | cuda_memory_pool_byte_size{3} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
I0214 02:34:40.795170 128 server.cc:293] Waiting for in-flight requests to complete. I0214 02:34:40.795178 128 server.cc:309] Timeout 30: Found 0 model versions that have in-flight inferences I0214 02:34:40.795629 128 server.cc:324] All models are stopped, unloading models I0214 02:34:40.795646 128 server.cc:331] Timeout 30: Found 3 live models and 0 in-flight non-inference requests I0214 02:34:41.795733 128 server.cc:331] Timeout 29: Found 3 live models and 0 in-flight non-inference requests Cleaning up... Cleaning up... Cleaning up... I0214 02:34:41.831995 128 model_lifecycle.cc:603] successfully unloaded 'tensorrt_llm_bls' version 1 I0214 02:34:42.045779 128 model_lifecycle.cc:603] successfully unloaded 'postprocessing' version 1 I0214 02:34:42.054175 128 model_lifecycle.cc:603] successfully unloaded 'preprocessing' version 1 I0214 02:34:42.795816 128 server.cc:331] Timeout 28: Found 0 live models and 0 in-flight non-inference requests error: creating server: Internal - failed to load all models
Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.
mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[16837,1],0] Exit code: 1
Expected behavior Model should be deployed successfully and be ready for inference