triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
657 stars 94 forks source link

Unable to initialize shared memory key 'triton_python_backend_shm_region_2' #562

Closed zhangyu68 closed 1 month ago

zhangyu68 commented 1 month ago

System Info

A100 80G

accelerate 0.31.0 aiohttp 3.9.5 aiosignal 1.3.1 annotated-types 0.7.0 async-timeout 4.0.3 attrs 23.2.0 blinker 1.4 Brotli 1.1.0 build 1.2.1 certifi 2024.6.2 charset-normalizer 3.3.2 cloudpickle 3.0.0 colored 2.2.4 coloredlogs 15.0.1 cryptography 3.4.8 cuda-python 12.5.0 datasets 2.20.0 dbus-python 1.2.18 diffusers 0.29.0 dill 0.3.8 distro 1.7.0 einops 0.8.0 evaluate 0.4.2 filelock 3.14.0 fire 0.6.0 frozenlist 1.4.1 fsspec 2024.5.0 gevent 24.2.1 geventhttpclient 2.0.2 greenlet 3.0.3 grpcio 1.64.1 h5py 3.10.0 httplib2 0.20.2 huggingface-hub 0.23.3 humanfriendly 10.0 idna 3.7 importlib-metadata 4.6.4 janus 1.0.0 jeepney 0.7.1 Jinja2 3.1.4 keyring 23.5.0 lark 1.1.9 launchpadlib 1.10.16 lazr.restfulclient 0.14.4 lazr.uri 1.0.6 markdown-it-py 3.0.0 MarkupSafe 2.1.5 mdurl 0.1.2 more-itertools 8.10.0 mpi4py 3.1.5 mpmath 1.3.0 multidict 6.0.5 multiprocess 0.70.16 networkx 3.3 ninja 1.11.1.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-modelopt 0.11.2 nvidia-nccl-cu12 2.19.3 nvidia-nvjitlink-cu12 12.5.40 nvidia-nvtx-cu12 12.1.105 oauthlib 3.2.0 onnx 1.16.1 optimum 1.20.0 packaging 24.1 pandas 2.2.2 pillow 10.3.0 pip 24.0 polygraphy 0.49.9 protobuf 4.25.3 psutil 5.9.8 PuLP 2.8.0 pyarrow 16.1.0 pyarrow-hotfix 0.6 pydantic 2.7.4 pydantic_core 2.18.4 Pygments 2.18.0 PyGObject 3.42.1 PyJWT 2.3.0 pynvml 11.4.1 pyparsing 2.4.7 pyproject_hooks 1.1.0 python-apt 2.4.0+ubuntu3 python-dateutil 2.9.0.post0 python-rapidjson 1.17 pytz 2024.1 PyYAML 6.0.1 regex 2024.5.15 requests 2.32.3 rich 13.7.1 safetensors 0.4.3 scipy 1.13.1 SecretStorage 3.3.1 sentencepiece 0.2.0 setuptools 69.2.0 six 1.16.0 StrEnum 0.4.15 sympy 1.12.1 tabulate 0.9.0 tensorrt 10.0.1 tensorrt-llm 0.10.0 termcolor 2.4.0 tiktoken 0.7.0 tokenizers 0.19.1 tomli 2.0.1 torch 2.2.2 tqdm 4.66.4 transformers 4.40.2 transformers-stream-generator 0.0.5 triton 2.2.0 tritonclient 2.46.0 typing_extensions 4.12.2 tzdata 2024.1 urllib3 2.2.1 wadllib 1.3.6 wheel 0.43.0 xxhash 3.4.1 yarl 1.9.4 zipp 1.0.0 zope.event 5.0 zope.interface 6.4.post2

Who can help?

@Tabrizian @

Information

Tasks

Reproduction

quant shell python convert_checkpoint.py --model_dir /workspace/zhangy34\@xiaopeng.com/original_models/qwen1.5-7b/model-hf/ \ --dtype float16 \ --output_dir ./qwen-1.5-7b/w8a16/2-gpu/wq_a100 \ --use_weight_only \ --weight_only_precision int8 \ --tp_size 2

trtllm-build --checkpoint_dir ./qwen-1.5-7b/w8a16/2-gpu/wq_a100 \ --output_dir ./qwen-1.5-7b/w8a16/2-gpu/engine_a100 \ --gemm_plugin float16 \ --max_batch_size 8 \ --max_input_len 4096 \ --max_output_len 256 \ --paged_kv_cache enable \ --remove_input_padding enable \ --context_fmha enable \ --use_paged_context_fmha enable \ --weight_only_precision int8

tensorrtllm backend config:

name: "tensorrt_llm" backend: "tensorrtllm" max_batch_size: 16

model_transaction_policy { decoupled: False }

input [ { name: "input_ids" data_type: TYPE_INT32 dims: [ -1 ] allow_ragged_batch: true }, { name: "input_lengths" data_type: TYPE_INT32 dims: [ 1 ] reshape: { shape: [ ] } }, { name: "request_output_len" data_type: TYPE_UINT32 dims: [ 1 ] }, { name: "end_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "pad_id" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "stop_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true allow_ragged_batch: true }, { name: "bad_words_list" data_type: TYPE_INT32 dims: [ 2, -1 ] optional: true allow_ragged_batch: true }, { name: "embedding_bias" data_type: TYPE_FP32 dims: [ -1 ] optional: true allow_ragged_batch: true }, { name: "beam_width" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "temperature" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_k" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "runtime_top_p" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "len_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "repetition_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "min_length" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "presence_penalty" data_type: TYPE_FP32 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "random_seed" data_type: TYPE_UINT64 dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "return_log_probs" data_type: TYPE_BOOL dims: [ 1 ] reshape: { shape: [ ] } optional: true }, { name: "stop" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "streaming" data_type: TYPE_BOOL dims: [ 1 ] optional: true }, { name: "prompt_embedding_table" data_type: TYPE_FP16 dims: [ -1, -1 ] optional: true allow_ragged_batch: true }, { name: "prompt_vocab_size" data_type: TYPE_UINT32 dims: [ 1 ] reshape: { shape: [ ] } optional: true } ] output [ { name: "output_ids" data_type: TYPE_INT32 dims: [ -1, -1 ] }, { name: "sequence_length" data_type: TYPE_INT32 dims: [ -1 ] }, { name: "cum_log_probs" data_type: TYPE_FP32 dims: [ -1 ] }, { name: "output_log_probs" data_type: TYPE_FP32 dims: [ -1, -1 ] } ] instance_group [ { count: 1 kind : KIND_CPU } ] parameters: { key: "max_beam_width" value: { string_value: "1" } } parameters: { key: "FORCE_CPU_ONLY_INPUT_TENSORS" value: { string_value: "no" } } parameters: { key: "gpt_model_type" value: { string_value: "inflight_fused_batching" } } parameters: { key: "gpt_model_path" value: { string_value: "/workspace/zhangy34@xiaopeng.com/code/tensorrt-llm-versions/TensorRT-LLM-0.10.0/examples/qwen/qwen-1.5-7b/w8a16/2-gpu/engine_a100" } }

parameters: { key: "batch_scheduler_policy" value: { string_value: "max_utilization" } } parameters: { key: "kv_cache_free_gpu_mem_fraction" value: { string_value: "0.85" } } parameters: { key: "max_num_sequences" value: { string_value: "16" } } parameters: { key: "enable_trt_overlap" value: { string_value: "False" } } parameters: { key: "enable_kv_cache_reuse" value: { string_value: "True" } }

then run

CUDA_VISIBLE_DEVICES=0,1 python3 ./scripts/launch_triton_server.py --world_size=2 \ --model_repo=./all_models/inflight_batcher_llm \ --backend tensorrt-llm

Expected behavior

server run success

actual behavior

The two error reports are inconsistent。 first error: I0809 10:17:45.529308 3219108 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7faf56000000' with size 268435456 I0809 10:17:45.529826 3219109 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f50d6000000' with size 268435456 I0809 10:17:45.552361 3219108 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0809 10:17:45.552389 3219108 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0809 10:17:45.552714 3219109 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0809 10:17:45.552742 3219109 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 W0809 10:17:46.174111 3219109 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:17:46.174168 3219109 model_lifecycle.cc:469] loading: preprocessing:1 W0809 10:17:46.175100 3219108 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:17:46.175132 3219108 model_lifecycle.cc:469] loading: preprocessing:1 W0809 10:17:46.175468 3219109 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:17:46.175489 3219109 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0809 10:17:46.176715 3219108 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:17:46.176743 3219108 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0809 10:17:46.176772 3219109 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:17:46.176791 3219109 model_lifecycle.cc:469] loading: postprocessing:1 W0809 10:17:46.178129 3219108 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:17:46.178152 3219108 model_lifecycle.cc:469] loading: postprocessing:1 I0809 10:17:46.257824 3219108 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) I0809 10:17:46.258378 3219109 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) I0809 10:17:46.355488 3219108 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) I0809 10:17:46.356304 3219109 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] enable_trt_overlap is deprecated and will be ignored [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0 [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] enable_trt_overlap is deprecated and will be ignored [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0 [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][INFO] Engine version 0.10.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found [TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set. [TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Engine version 0.10.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found [TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set. [TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 terminate called after throwing an instance of 'boost::interprocess::lock_exception' what(): boost::interprocess::lock_exception [cnwla-a800-p01142:3219109] Process received signal [cnwla-a800-p01142:3219109] Signal: Aborted (6) [cnwla-a800-p01142:3219109] Signal code: (-6) [cnwla-a800-p01142:3219109] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f512b71e520] [cnwla-a800-p01142:3219109] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f512b7729fc] [cnwla-a800-p01142:3219109] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f512b71e476] [cnwla-a800-p01142:3219109] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f512b7047f3] [cnwla-a800-p01142:3219109] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f512b9a7b9e] [cnwla-a800-p01142:3219109] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f512b9b320c] [cnwla-a800-p01142:3219109] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f512b9b21e9] [cnwla-a800-p01142:3219109] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f512b9b2959] [cnwla-a800-p01142:3219109] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f512daf6884] [cnwla-a800-p01142:3219109] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7f512daf6f41] [cnwla-a800-p01142:3219109] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x7f512b9b34cb] [cnwla-a800-p01142:3219109] [11] /opt/tritonserver/backends/python/libtriton_python.so(+0x87bfa)[0x7f511821cbfa] [cnwla-a800-p01142:3219109] [12] /opt/tritonserver/backends/python/libtriton_python.so(+0x7800c)[0x7f511820d00c] [cnwla-a800-p01142:3219109] [13] /opt/tritonserver/backends/python/libtriton_python.so(+0x7ed06)[0x7f5118213d06] [cnwla-a800-p01142:3219109] [14] /opt/tritonserver/backends/python/libtriton_python.so(+0x9930a)[0x7f511822e30a] [cnwla-a800-p01142:3219109] [15] /opt/tritonserver/backends/python/libtriton_python.so(+0x853b3)[0x7f511821a3b3] [cnwla-a800-p01142:3219109] [16] /opt/tritonserver/backends/python/libtriton_python.so(+0x3c4c4)[0x7f51181d14c4] [cnwla-a800-p01142:3219109] [17] /opt/tritonserver/backends/python/libtriton_python.so(TRITONBACKEND_ModelInstanceInitialize+0x4ec)[0x7f51181d1d0c] [cnwla-a800-p01142:3219109] [18] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af096)[0x7f512c124096] [cnwla-a800-p01142:3219109] [19] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02d6)[0x7f512c1252d6] [cnwla-a800-p01142:3219109] [20] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928e5)[0x7f512c1078e5] [cnwla-a800-p01142:3219109] [21] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f26)[0x7f512c107f26] [cnwla-a800-p01142:3219109] [22] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f81d)[0x7f512c11481d] [cnwla-a800-p01142:3219109] [23] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8)[0x7f512b775ee8] [cnwla-a800-p01142:3219109] [24] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18afee)[0x7f512c0fffee] [cnwla-a800-p01142:3219109] [25] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f512b9e1253] [cnwla-a800-p01142:3219109] [26] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f512b770ac3] [cnwla-a800-p01142:3219109] [27] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f512b801a04] [cnwla-a800-p01142:3219109] End of error message

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

I0809 10:17:47.426345 3220531 pb_stub.cc:2025] Non-graceful termination detected. terminate called after throwing an instance of 'boost::interprocess::lock_exception' what(): boost::interprocess::lock_exception

mpirun noticed that process rank 1 with PID 0 on node cnwla-a800-p01142 exited on signal 6 (Aborted).

second error: I0809 10:19:07.015127 3254750 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f0efc000000' with size 268435456 I0809 10:19:07.016531 3254749 pinned_memory_manager.cc:275] Pinned memory pool is created at '0x7f96aa000000' with size 268435456 I0809 10:19:07.022014 3254750 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0809 10:19:07.022045 3254750 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 I0809 10:19:07.022912 3254749 cuda_memory_manager.cc:107] CUDA memory pool is created on device 0 with size 67108864 I0809 10:19:07.022933 3254749 cuda_memory_manager.cc:107] CUDA memory pool is created on device 1 with size 67108864 W0809 10:19:07.415077 3254749 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:19:07.415121 3254749 model_lifecycle.cc:469] loading: preprocessing:1 W0809 10:19:07.415367 3254750 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:19:07.415410 3254750 model_lifecycle.cc:469] loading: preprocessing:1 W0809 10:19:07.416502 3254749 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:19:07.416534 3254749 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0809 10:19:07.416788 3254750 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:19:07.416812 3254750 model_lifecycle.cc:469] loading: tensorrt_llm:1 W0809 10:19:07.418058 3254749 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:19:07.418084 3254749 model_lifecycle.cc:469] loading: postprocessing:1 W0809 10:19:07.418208 3254750 model_lifecycle.cc:108] ignore version directory '.ipynb_checkpoints' which fails to convert to integral number I0809 10:19:07.418226 3254750 model_lifecycle.cc:469] loading: postprocessing:1 I0809 10:19:07.495346 3254749 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) I0809 10:19:07.495348 3254750 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: preprocessing_0_0 (CPU device 0) E0809 10:19:07.495844 3254750 backend_model.cc:691] ERROR: Failed to create instance: Unable to initialize shared memory key 'triton_python_backend_shm_region_1' to requested size (1048576 bytes). If you are running Triton inside docker, use '--shm-size' flag to control the shared memory region size. Each Python backend model instance requires at least 1 MB of shared memory. Error: File exists E0809 10:19:07.495899 3254750 model_lifecycle.cc:638] failed to load 'preprocessing' version 1: Internal: Unable to initialize shared memory key 'triton_python_backend_shm_region_1' to requested size (1048576 bytes). If you are running Triton inside docker, use '--shm-size' flag to control the shared memory region size. Each Python backend model instance requires at least 1 MB of shared memory. Error: File exists I0809 10:19:07.495915 3254750 model_lifecycle.cc:773] failed to load 'preprocessing' I0809 10:19:07.556648 3254750 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] enable_trt_overlap is deprecated and will be ignored [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0 [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 I0809 10:19:07.560612 3254749 python_be.cc:2391] TRITONBACKEND_ModelInstanceInitialize: postprocessing_0_0 (CPU device 0) [TensorRT-LLM][WARNING] gpu_device_ids is not specified, will be automatically set [TensorRT-LLM][WARNING] iter_stats_max_iterations is not specified, will use default value of 1000 [TensorRT-LLM][WARNING] request_stats_max_iterations is not specified, will use default value of 0 [TensorRT-LLM][WARNING] enable_trt_overlap is deprecated and will be ignored [TensorRT-LLM][WARNING] normalize_log_probs is not specified, will be set to true [TensorRT-LLM][WARNING] max_tokens_in_paged_kv_cache is not specified, will use default value [TensorRT-LLM][WARNING] kv_cache_host_memory_bytes not set, defaulting to 0 [TensorRT-LLM][WARNING] kv_cache_onboard_blocks not set, defaulting to true [TensorRT-LLM][WARNING] max_attention_window_size is not specified, will use default value (i.e. max_sequence_length) [TensorRT-LLM][WARNING] sink_token_length is not specified, will use default value [TensorRT-LLM][WARNING] enable_chunked_context is not specified, will be set to false. [TensorRT-LLM][WARNING] lora_cache_max_adapter_size not set, defaulting to 64 [TensorRT-LLM][WARNING] lora_cache_optimal_adapter_size not set, defaulting to 8 [TensorRT-LLM][WARNING] lora_cache_gpu_memory_fraction not set, defaulting to 0.05 [TensorRT-LLM][WARNING] lora_cache_host_memory_bytes not set, defaulting to 1GB [TensorRT-LLM][WARNING] decoding_mode parameter is invalid or not specified(must be one of the {top_k, top_p, top_k_top_p, beam_search}).Using default: top_k_top_p if max_beam_width == 1, beam_search otherwise [TensorRT-LLM][WARNING] medusa_choices parameter is not specified. Will be using default mc_sim_7b_63 choices instead [TensorRT-LLM][WARNING] gpu_weights_percent parameter is not specified, will use default value of 1.0 [TensorRT-LLM][INFO] Engine version 0.10.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found [TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set. [TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 [TensorRT-LLM][INFO] Engine version 0.10.0 found in the config file, assuming engine(s) built by new builder API. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found [TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set. [TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json: [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found [TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null [TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found [TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set. [TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found [TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set. [TensorRT-LLM][INFO] Initializing MPI with thread mode 3 I0809 10:19:07.610370 3255234 pb_stub.cc:191] Unable to initialize shared memory key 'triton_python_backend_shm_region_1' to requested size (1048576 bytes). If you are running Triton inside docker, use '--shm-size' flag to control the shared memory region size. Each Python backend model instance requires at least 1 MB of shared memory. Error: No such file or directory

free(): double free detected in tcache 2 terminate called after throwing an instance of 'boost::interprocess::lock_exception' what(): boost::interprocess::lock_exception [cnwla-a800-p01142:3254749] Process received signal [cnwla-a800-p01142:3254749] Signal: Aborted (6) [cnwla-a800-p01142:3254749] Signal code: (-6) [cnwla-a800-p01142:3254749] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7f970071e520] [cnwla-a800-p01142:3254749] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7f97007729fc] [cnwla-a800-p01142:3254749] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7f970071e476] [cnwla-a800-p01142:3254749] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7f97007047f3] [cnwla-a800-p01142:3254749] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7f97009a7b9e] [cnwla-a800-p01142:3254749] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7f97009b320c] [cnwla-a800-p01142:3254749] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7f97009b21e9] [cnwla-a800-p01142:3254749] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7f97009b2959] [cnwla-a800-p01142:3254749] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7f9702ac8884] [cnwla-a800-p01142:3254749] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_RaiseException+0x311)[0x7f9702ac8f41] [cnwla-a800-p01142:3254749] [10] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__cxa_throw+0x3b)[0x7f97009b34cb] [cnwla-a800-p01142:3254749] [11] /opt/tritonserver/backends/python/libtriton_python.so(+0x87bfa)[0x7f96f8344bfa] [cnwla-a800-p01142:3254749] [12] /opt/tritonserver/backends/python/libtriton_python.so(+0x7800c)[0x7f96f833500c] [cnwla-a800-p01142:3254749] [13] /opt/tritonserver/backends/python/libtriton_python.so(+0x7ed06)[0x7f96f833bd06] [cnwla-a800-p01142:3254749] [14] /opt/tritonserver/backends/python/libtriton_python.so(+0x9930a)[0x7f96f835630a] [cnwla-a800-p01142:3254749] [15] /opt/tritonserver/backends/python/libtriton_python.so(+0x853b3)[0x7f96f83423b3] [cnwla-a800-p01142:3254749] [16] /opt/tritonserver/backends/python/libtriton_python.so(+0x3c4c4)[0x7f96f82f94c4] [cnwla-a800-p01142:3254749] [17] /opt/tritonserver/backends/python/libtriton_python.so(TRITONBACKEND_ModelInstanceInitialize+0x4ec)[0x7f96f82f9d0c] [cnwla-a800-p01142:3254749] [18] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1af096)[0x7f9701124096] [cnwla-a800-p01142:3254749] [19] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1b02d6)[0x7f97011252d6] [cnwla-a800-p01142:3254749] [20] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x1928e5)[0x7f97011078e5] [cnwla-a800-p01142:3254749] [21] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x192f26)[0x7f9701107f26] [cnwla-a800-p01142:3254749] [22] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x19f81d)[0x7f970111481d] [cnwla-a800-p01142:3254749] [23] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x99ee8)[0x7f9700775ee8] [cnwla-a800-p01142:3254749] [24] /opt/tritonserver/bin/../lib/libtritonserver.so(+0x18afee)[0x7f97010fffee] [cnwla-a800-p01142:3254749] [25] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xdc253)[0x7f97009e1253] [cnwla-a800-p01142:3254749] [26] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3)[0x7f9700770ac3] [cnwla-a800-p01142:3254749] [27] /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44)[0x7f9700801a04] [cnwla-a800-p01142:3254749] End of error message

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

additional notes

If I use this example to generate the configuration file, it can be started successfully,but when I test the delay time, the delay is not normal. By the way, I also tried using the qwen-14b model to run on version 0.9.0, and it worked fine.

zhangyu68 commented 1 month ago

just remove "max_tokens_in_paged_kv_cache" in tensorrt_llm/config.pbtxt and the v0.10.0 config can run as fast as v0.6.0 config

https://github.com/triton-inference-server/tensorrtllm_backend/issues/453