triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
581 stars 81 forks source link

missing nv_trt_llm_request_metrics from python backend #479

Open Hao-YunDeng opened 1 month ago

Hao-YunDeng commented 1 month ago

System Info

GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions:

https://github.com/NVIDIA/TensorRT-LLM.git (https://github.com/NVIDIA/TensorRT-LLM/commit/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d) https://github.com/triton-inference-server/tensorrtllm_backend.git (https://github.com/triton-inference-server/tensorrtllm_backend/commit/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe)

Who can help?

@ncomly-nvidia

Information

Tasks

Reproduction

step 1:

python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir zephyr-7b-beta --output_dir zephyr-7b-beta-converted --dtype float16

step 2:

trtllm-build --checkpoint_dir zephyr-7b-beta-converted --output_dir zephyr-7b-beta-trt-engine --remove_input_padding enable --context_fmha enable --gpt_attention_plugin float16 --gemm_plugin float16 --paged_kv_cache enable --max_num_tokens 65536 --max_batch_size 32 --max_input_len 16384 --strongly_typed

step 3 tensorrtllm_backend parameters:

MODEL_PATH=zephyr-7b-beta MODEL_PIPELINE_NAME=triton_model_repo MAX_BATCH_SIZE=32 ENGINE_PATH=zephyr-7b-beta-trt-engine MAX_ATTENTION_WINDOW_SIZE=4096 KV_CACHE_FREE_GPU_MEM_FRACTION=0.5 batch_scheduler_policy=guaranteed_no_evict python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:zephyr-7b-beta/,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/postprocessing/config.pbtxt tokenizer_dir:${MODEL_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE} python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm/config.pbtxt triton_backend:python triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_PATH},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:${batch_scheduler_policy} python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/code/tensorrtllm_backend/${MODEL_PIPELINE_NAME} --http_port=8081 --log --log-file ${MODEL_PIPELINE_NAME}_triton_log.txt

Expected behavior

nv_trt_llm_request_metrics is expected to exist in python backend

actual behavior

nv_trt_llm_request_metrics does NOT exist in python backend

additional notes

None