triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
611 stars 86 forks source link

Launch Triton server error occurred #266

Open Burning-XX opened 7 months ago

Burning-XX commented 7 months ago
My Gpu Config

image

Tensorrt Engine Build Command

python3 build.py --model_dir /opt/llms/llama-7b --dtype float16 --remove_input_padding --load_by_shard --use_gpt_attention_plugin float16 --enable_context_fmha --load_by_shard --use_gemm_plugin float16 --use_inflight_batching --output_dir /opt/trtModel/llama/1-gpu

I already have a builded tensorrt engine(llama), and trying to launch it by trition server(main branch)。 But, some error occurred as follows, I am confused.

Launch Triton Server Command

python3 scripts/launch_triton_server.py --world_size=1 --model_repo=triton_model_repo

Error Message

I1228 08:05:24.547254 57912 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I1228 08:05:24.552840 57912 model_lifecycle.cc:461] loading: postprocessing:1 I1228 08:05:24.552905 57912 model_lifecycle.cc:461] loading: preprocessing:1 I1228 08:05:24.552981 57912 model_lifecycle.cc:461] loading: tensorrt_llm:1 I1228 08:05:24.553145 57912 model_lifecycle.cc:461] loading: tensorrt_llm_bls:1 I1228 08:05:24.564589 57912 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) I1228 08:05:24.564623 57912 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) I1228 08:05:24.632242 57912 python_be.cc:2199] TRITONBACKEND_ModelInstanceInitialize: tensorrt_llm_bls_0_0 (CPU device 0)[TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] The batch scheduler policy will be set to guaranteed_no_evictsince the backend operates in decoupled mode [TensorRT-LLM][WARNING] kv_cache_free_gpu_mem_fraction is not specified, will use default value of 0.85 or max_tokens_in_paged_kv_cache [TensorRT-LLM][WARNING] max_num_sequences is not specified, will be set to the TRT engine max_batch_size [TensorRT-LLM][WARNING] enable_trt_overlap is not specified, will be set to true [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be number, but is null [TensorRT-LLM][WARNING] Optional value for parameter max_num_tokens will not be set. [TensorRT-LLM][INFO] Initializing MPI with thread mode 1 [TensorRT-LLM][INFO] MPI size: 1, rank: 0 I1228 08:05:24.917601 57912 model_lifecycle.cc:818] successfully loaded 'tensorrt_llm_bls' I1228 08:05:25.190698 57912 model_lifecycle.cc:818] successfully loaded 'postprocessing' I1228 08:05:25.207966 57912 pb_stub.cc:325] Failed to initialize Python stub: IndexError: list index out of range At: /opt/tensorrt_llm/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(81): initialize E1228 08:05:25.568128 57912 backend_model.cc:634] ERROR: Failed to create instance: IndexError: list index out of range At: /opt/tensorrt_llm/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(81): initialize E1228 08:05:25.568247 57912 model_lifecycle.cc:621] failed to load 'preprocessing' version 1: Internal: IndexError: list index out of range At: /opt/tensorrt_llm/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(81): initialize I1228 08:05:25.568271 57912 model_lifecycle.cc:756] failed to load 'preprocessing' [TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 16 [TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 8 [TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 1 [TensorRT-LLM][INFO] Loaded engine size: 12855 MiB [TensorRT-LLM][ERROR] 1: [stdArchiveReader.cpp::stdArchiveReaderInitCommon::47] Error Code 1: Serialization (Serialization assertion stdVersionRead == kSERIALIZATION_VERSION failed.Version tag does not match. Note: Current Version: 226, Serialized Engine Version: 228) E1228 08:05:36.306362 57912 backend_model.cc:634] ERROR: Failed to create instance: unexpected error when creating modelInstanceState: [TensorRT-LLM][ERROR] Assertion failed: Failed to deserialize cuda engine (/app/tensorrt_llm/cpp/tensorrt_llm/runtime/tllmRuntime.cpp:74) 1 0x7f77a27ff645 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x17645) [0x7f77a27ff645] 2 0x7f77a28d572c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0xed72c) [0x7f77a28d572c] 3 0x7f77a284ef4e /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x66f4e) [0x7f77a284ef4e] 4 0x7f77a283ec0c /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x56c0c) [0x7f77a283ec0c] 5 0x7f77a28395f5 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x515f5) [0x7f77a28395f5] 6 0x7f77a28374db /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x4f4db) [0x7f77a28374db] 7 0x7f77a281b182 /opt/tritonserver/backends/tensorrtllm/libtriton_tensorrtllm.so(+0x33182) [0x7f77a281b182] 8 0x7f77a281b235 TRITONBACKEND_ModelInstanceInitialize + 101 9 0x7f780439aa86 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a4a86) [0x7f780439aa86] 10 0x7f780439bcc6 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1a5cc6) [0x7f780439bcc6] 11 0x7f780437ec15 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x188c15) [0x7f780437ec15] 12 0x7f780437f256 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x189256) [0x7f780437f256] 13 0x7f780438b27d /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19527d) [0x7f780438b27d] 14 0x7f78039f9ee8 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8) [0x7f78039f9ee8] 15 0x7f780437597b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x17f97b) [0x7f780437597b] 16 0x7f7804385695 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18f695) [0x7f7804385695] 17 0x7f780438a50b /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19450b) [0x7f780438a50b] 18 0x7f7804473610 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x27d610) [0x7f7804473610] 19 0x7f7804476d03 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x280d03) [0x7f7804476d03] 20 0x7f78045c38b2 /opt/tritonserver/bin/../lib/libtritonserver.so(+0x3cd8b2) [0x7f78045c38b2] 21 0x7f7803c64253 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253) [0x7f7803c64253] 22 0x7f78039f4ac3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7f78039f4ac3] 23 0x7f7803a86a40 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x126a40) [0x7f7803a86a40]

More Details

image

Burning-XX commented 6 months ago

Here is the triton_model_repo/preprocessing/config.pbtxt content:

preprocessing/config.pbtxt

image

Burning-XX commented 6 months ago

Here is the triton_model_repo/tensorrt_llm/config.pbtxt content:

tensorrt_llm/config.pbtxt

image

Burning-XX commented 6 months ago

Here is the triton_model_repo/postprocessing/config.pbtxt content:

postprocessing/config.pbtxt

image

byshiue commented 6 months ago

You could check the codes here /opt/tensorrt_llm/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(81) to investigate what error happens.

Burning-XX commented 6 months ago

You could check the codes here /opt/tensorrt_llm/tensorrtllm_backend/triton_model_repo/preprocessing/1/model.py(81) to investigate what error happens.

It seems didn't get pad_token variable when load llama model。

image

I checked the /opt/llms/llama-7b/tokenizer_config.json file, what should I do to make it work

image
Burning-XX commented 6 months ago

@ArthurZucker

byshiue commented 6 months ago

In our checkpoint, the tokenizer contains the pad_id. If your checkpoint does not have pad_id, you can try replace the pad_id by eos_id.