Open Hao-YunDeng opened 1 month ago
GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions:
https://github.com/NVIDIA/TensorRT-LLM.git (https://github.com/NVIDIA/TensorRT-LLM/commit/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d) https://github.com/triton-inference-server/tensorrtllm_backend.git (https://github.com/triton-inference-server/tensorrtllm_backend/commit/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe)
@ncomly-nvidia
examples
step 1:
python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir zephyr-7b-beta --output_dir zephyr-7b-beta-converted --dtype float16
step 2:
trtllm-build --checkpoint_dir zephyr-7b-beta-converted --output_dir zephyr-7b-beta-trt-engine --remove_input_padding enable --context_fmha enable --gpt_attention_plugin float16 --gemm_plugin float16 --paged_kv_cache enable --max_num_tokens 65536 --max_batch_size 32 --max_input_len 16384 --strongly_typed
step 3 tensorrtllm_backend parameters:
MODEL_PATH=zephyr-7b-beta MODEL_PIPELINE_NAME=triton_model_repo MAX_BATCH_SIZE=32 ENGINE_PATH=zephyr-7b-beta-trt-engine MAX_ATTENTION_WINDOW_SIZE=4096 KV_CACHE_FREE_GPU_MEM_FRACTION=0.5 batch_scheduler_policy=guaranteed_no_evict python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:zephyr-7b-beta/,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/postprocessing/config.pbtxt tokenizer_dir:${MODEL_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE} python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm/config.pbtxt triton_backend:python triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_PATH},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:${batch_scheduler_policy} python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/code/tensorrtllm_backend/${MODEL_PIPELINE_NAME} --http_port=8081 --log --log-file ${MODEL_PIPELINE_NAME}_triton_log.txt
nv_trt_llm_request_metrics is expected to exist in python backend
nv_trt_llm_request_metrics does NOT exist in python backend
None
System Info
GPU: NVIDIA A100 Driver Version: 545.23.08 CUDA: 12.3 versions:
https://github.com/NVIDIA/TensorRT-LLM.git (https://github.com/NVIDIA/TensorRT-LLM/commit/bf0a5afc92f4b2b3191e9e55073953c1f600cf2d) https://github.com/triton-inference-server/tensorrtllm_backend.git (https://github.com/triton-inference-server/tensorrtllm_backend/commit/ae52bce3ed8ecea468a16483e0dacd3d156ae4fe)
Who can help?
@ncomly-nvidia
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
step 1:
python3 ./tensorrtllm_backend/tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir zephyr-7b-beta --output_dir zephyr-7b-beta-converted --dtype float16
step 2:
trtllm-build --checkpoint_dir zephyr-7b-beta-converted --output_dir zephyr-7b-beta-trt-engine --remove_input_padding enable --context_fmha enable --gpt_attention_plugin float16 --gemm_plugin float16 --paged_kv_cache enable --max_num_tokens 65536 --max_batch_size 32 --max_input_len 16384 --strongly_typed
step 3 tensorrtllm_backend parameters:
MODEL_PATH=zephyr-7b-beta MODEL_PIPELINE_NAME=triton_model_repo MAX_BATCH_SIZE=32 ENGINE_PATH=zephyr-7b-beta-trt-engine MAX_ATTENTION_WINDOW_SIZE=4096 KV_CACHE_FREE_GPU_MEM_FRACTION=0.5 batch_scheduler_policy=guaranteed_no_evict python3 tools/fill_template.py -i triton_model_repo/preprocessing/config.pbtxt tokenizer_dir:zephyr-7b-beta/,triton_max_batch_size:${MAX_BATCH_SIZE},preprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/postprocessing/config.pbtxt tokenizer_dir:${MODEL_PATH},triton_max_batch_size:${MAX_BATCH_SIZE},postprocessing_instance_count:1 python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/ensemble/config.pbtxt triton_max_batch_size:${MAX_BATCH_SIZE} python3 tools/fill_template.py -i ${MODEL_PIPELINE_NAME}/tensorrt_llm/config.pbtxt triton_backend:python triton_max_batch_size:${MAX_BATCH_SIZE},decoupled_mode:True,max_beam_width:1,engine_dir:${ENGINE_PATH},max_attention_window_size:${MAX_ATTENTION_WINDOW_SIZE},kv_cache_free_gpu_mem_fraction:${KV_CACHE_FREE_GPU_MEM_FRACTION},exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,batch_scheduler_policy:${batch_scheduler_policy} python3 scripts/launch_triton_server.py --world_size=1 --model_repo=/code/tensorrtllm_backend/${MODEL_PIPELINE_NAME} --http_port=8081 --log --log-file ${MODEL_PIPELINE_NAME}_triton_log.txt
Expected behavior
nv_trt_llm_request_metrics is expected to exist in python backend
actual behavior
nv_trt_llm_request_metrics does NOT exist in python backend
additional notes
None