triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.14k stars 1.46k forks source link

When the request is large, the Triton server has a very high TTFT. #7316

Open Godlovecui opened 4 months ago

Godlovecui commented 4 months ago

Description I run benchmark of Meta-Llama-3-8B-Instruct in RTX 8*4090, image when request is 16, input sequence length is 1024, output sequence length is 1024, The TTFT(time to first token) is 0.403s, which is acceptable。 image However, when request is 1024, image The TTFT is 379.089s. Is this normal? image

Triton Information TensorRT-LLM:v0.9.0 tensorrtllm_backend: v0.9.0

Are you using the Triton container or did you build it yourself? yes To Reproduce

  1. run docker docker run -d --gpus all --privileged --ipc=host --net=host --ulimit stack=67108864 --ulimit memlock=-1 -e HTTPS_PROXY= -e HTTP_PROXY= -e ALL_PROXY= -e https_proxy= -e http_proxy= -e all_proxy= -e CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 --name benchmark_trtllm -v /models/Meta-Llama-3-8B-Instruct:/models/Meta-Llama-3-8B-Instruct:ro -v /path/to/workspaces/tensorrtllm_backend/engine_outputs_llama3_8B_0524:/path/to/workspaces/tensorrtllm_backend/engine_outputs_llama3_8B_0524:ro -v /path/to/workspaces/tensorrtllm_backend/triton_model_repo:/path/to/workspaces/tensorrtllm_backend/triton_model_repo -w /workspace nvcr.io/nvidia/tritonserver:24.02 /path/to/workspaces/tensorrtllm_backend/triton_model_repo/launch_triton_server.py --world_size=8 --model_repo=/path/to/workspaces/tensorrtllm_backend/triton_model_repo
  2. run benckmark. (the benchmark_serving.py file can be found in https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py) python /path/to/workspaces/mlops/test/benchmark_serving.py --backend trtllm --model /models/Meta-Llama-3-8B-Instruct --tokenizer /models/Meta-Llama-3-8B-Instruct --dataset /models/ShareGPT_Vicuna_unfiltered/ShareGPT_V3_unfiltered_cleaned_split.json --port 8000 --num-warmup-requests 32 --num-benchmark-requests 1024 --max-concurrent-requests 64 --stream --pad-requests --warn-dismatch-output-len --gpus 8 --sampling-policy fixed --fixed_prompt_len 1024 --fixed_output_len 1024 --endpoint v2/models/ensemble/generate_stream

Expected behavior The TTFT is lower

statiraju commented 3 months ago

[6863] created to track

pansicheng commented 1 month ago

It seems that the value of tpot is nearly 0. Does that mean the generate_stream API returns everything at once instead of streaming?