huaiguang commented 8 months ago

System Info

Linux devserver-ei 5.4.0-144-generic #161-Ubuntu SMP Fri Feb 3 14:49:04 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Fri_Sep__8_19:17:24_PDT_2023 Cuda compilation tools, release 12.3, V12.3.52 Build cuda_12.3.r12.3/compiler.33281558_0

============================= == Triton Inference Server ==

NVIDIA Release 23.11 (build 74978875) Triton Server Version 2.40.0

This container image and its contents are governed by the NVIDIA Deep Learning Container License. By pulling and using the container, you accept the terms and conditions of this license: https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

NOTE: CUDA Forward Compatibility mode ENABLED. Using CUDA 12.3 driver version 545.23.08 with kernel driver version 525.105.17. See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

tensorrt_llm: 0.6.0 tensorrtllm_backend: tensorrtllm_backend-main

Who can help?

@kaiyux @ncomly-nvidia @ju

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

following https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama.md. except path, the other instruction is the same.

root@devserver-ei:/home/ma-user/cjx/run_tensorrt/tensorrtllm_backend-main# python scripts/launch_triton_server.py \ --http_port 8010 \ --grpc_port 8011 \ --metrics_port 8012 \ --model_repo=llama_ifb/ root@devserver-ei:/home/ma-user/cjx/run_tensorrt/tensorrtllm_backend-main# I0229 06:56:33.489770 13115 pinned_memory_manager.cc:241] Pinned memory pool is created at '0x7f144e000000' with size 268435456 I0229 06:56:33.490316 13115 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0229 06:56:33.493135 13115 model_lifecycle.cc:461] loading: postprocessing:1 I0229 06:56:33.493166 13115 model_lifecycle.cc:461] loading: preprocessing:1 I0229 06:56:33.493193 13115 model_lifecycle.cc:461] loading: tensorrt_llm:1 I0229 06:56:33.493222 13115 model_lifecycle.cc:461] loading: tensorrt_llm_bls:1 I0229 06:56:33.501259 13115 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) I0229 06:56:33.501270 13115 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) I0229 06:56:33.539458 13115 python_be.cc:2389] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0) [TensorRT-LLM][WARNING] batch_scheduler_policy parameter was not found or is invalid (must be max_utilization or guaranteed_no_evict) [TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true [TensorRT-LLM][WARNING] max_kv_cache_length is not specified, will use default value [TensorRT-LLM][WARNING] Parameter max_draft_len cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null [TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set. [TensorRT-LLM][INFO] Initializing MPI with thread mode 1 I0229 06:56:33.723846 13115 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm_bls' [TensorRT-LLM][INFO] MPI size: 1, rank: 0 I0229 06:56:34.207574 13115 model_lifecycle.cc:818] successfully loaded 'postprocessing' I0229 06:56:34.210109 13115 model_lifecycle.cc:818] successfully loaded 'preprocessing' [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 128 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 64 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1 [TensorRT-LLM][INFO] Loaded engine size: 12856 MiB [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 12898, GPU 62151 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 12900, GPU 62161 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +12852, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 12922, GPU 71819 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 12923, GPU 71827 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 12956, GPU 71847 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 12956, GPU 71857 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 12990, GPU 71877 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 12990, GPU 71887 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 13023, GPU 71905 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] Init cuDNN: CPU +1, GPU +12, now: CPU 13024, GPU 71917 (MiB) [TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 12852 (MiB) [TensorRT-LLM][INFO] Using 2560 tokens in paged KV cache. I0229 06:56:46.216883 13115 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm' I0229 06:56:46.217368 13115 model_lifecycle.cc:461] loading: ensemble:1 I0229 06:56:46.217632 13115 model_lifecycle.cc:818] successfully loaded 'ensemble' I0229 06:56:46.217694 13115 server.cc:606] +------------------+------+ | Repository Agent | Path | +------------------+------+ +------------------+------+

I0229 06:56:46.217736 13115 server.cc:633] +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ | Backend | Path | Config | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+ | python | /opt/tritonserver/backends/python/libtritonpython.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","shm-region-prefix-name | | | | ":"prefix0","default-max-batch-size":"4"}} | | tensorrtllm | /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so | {"cmdline":{"auto-complete-config":"false","backend-directory":"/opt/tritonserver/backends","min-compute-capability":"6.000000","default-max-batch-size | | | | ":"4"}} | +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------+

I0229 06:56:46.326711 13115 metrics.cc:817] Collecting metrics for GPU 0: NVIDIA A800-SXM4-80GB I0229 06:56:46.363179 13115 metrics.cc:710] Collecting CPU metrics I0229 06:56:46.363335 13115 tritonserver.cc:2483] +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Option | Value | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | server_id | triton | | server_version | 2.40.0 | | server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trac | | | e logging | | model_repository_path[0] | llama_ifb/ | | model_control_mode | MODE_NONE | | strict_model_config | 1 | | rate_limit | OFF | | pinned_memory_pool_byte_size | 268435456 | | cuda_memory_pool_byte_size{0} | 67108864 | | min_supported_compute_capability | 6.0 | | strict_readiness | 1 | | exit_timeout | 30 | | cache_enabled | 0 | +----------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0229 06:56:46.364285 13115 grpc_server.cc:2469] Started GRPCInferenceService at 0.0.0.0:8011 I0229 06:56:46.364468 13115 http_server.cc:4554] Started HTTPService at 0.0.0.0:8010 I0229 06:56:46.449988 13115 http_server.cc:282] Started Metrics Service at 0.0.0.0:8012

Expected behavior

curl -X POST localhost:8010/v2/models/ensemble/generate -d '{"text_input": "What is machine learning?", "max_tokens": 64, "bad_words": "", "stop_words": "", "pad_id": 2, "end_id": 2}'

{"cum_log_probs":0.0,"model_name":"ensemble","model_version":"1","output_log_probs":[0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0],"sequence_end":false,"sequence_id":0,"sequence_start":false,"text_output":"\nMachine learning is a subset of artificial intelligence (AI) that uses algorithms to learn from data and"}

actual behavior

{"error":"in ensemble 'ensemble', [request id: ] unexpected deadlock, at least one output is not set while no more ensemble steps can be made"}

additional notes

[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null [TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set.

There maybe is the configuration in llama_ifb/tensorrt_llm/config.pbtxt, which can set dtype int32

huaiguang commented 8 months ago

python inflight_batcher_llm/client/inflight_batcher_llm_client.py --request-output-len 200 --tokenizer-dir /home/ma-user/model/llama-7B --tokenizer-type llama

Input sequence: [1, 19298, 297, 6641, 29899, 23027, 3444, 29892, 1105, 7598, 16370, 408, 263] Got completed request -- here --: [StatusCode.INVALID_ARGUMENT] [request id: ] inference input 'end_id' data-type is 'INT32', but model 'tensorrt_llm' expects 'UINT32' Received an error from server: Encountered error: [StatusCode.INVALID_ARGUMENT] [request id: ] inference input 'end_id' data-type is 'INT32', but model 'tensorrt_llm' expects 'UINT32' Encountered error: [StatusCode.INVALID_ARGUMENT] [request id: ] inference input 'end_id' data-type is 'INT32', but model 'tensorrt_llm' expects 'UINT32'

gyd-a commented 3 weeks ago

I have the same problem. How did you solve it? Request “localhost:8000/v2/models/tensorrt_llm_bls/generate” interface can be successful. docker image: nvcr.io/nvidia/tritonserver:24.08-trtllm-python-py3 Perf: V100-SXM2-32GB Driver Version: 550.90.12 CUDA Version: 12.4

gyd-a commented 3 weeks ago

I solved it. The main reason is that the versions of "tensorrt-llm" and "tensorrtllm_backend" are different.