Open hliuca opened 4 days ago
This is definitely not expected and very unusual. Did you also see this on NV GPUs? Do you have any details profiling results?
Hi @merrymercy I will run on NV GPUs and update here. Thanks for looking into this.
H100 TP1,
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 18.49 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256849 Request throughput (req/s): 216.33 Input token throughput (tok/s): 13921.61 Output token throughput (tok/s): 13951.41 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 14500.46 Median E2E Latency (ms): 15651.88 ---------------Time to First Token---------------- Mean TTFT (ms): 4796.60 Median TTFT (ms): 4020.55 P99 TTFT (ms): 8216.55 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 243.34 Median TPOT (ms): 161.56 P99 TPOT (ms): 1695.36 ---------------Inter-token Latency---------------- Mean ITL (ms): 152.84 Median ITL (ms): 85.30 P99 ITL (ms): 2405.32
TP8, ============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 17.16 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256818 Request throughput (req/s): 233.06 Input token throughput (tok/s): 14997.93 Output token throughput (tok/s): 15030.04 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 13159.01 Median E2E Latency (ms): 14298.66 ---------------Time to First Token---------------- Mean TTFT (ms): 4301.37 Median TTFT (ms): 3996.11 P99 TTFT (ms): 5583.20 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 197.62 Median TPOT (ms): 152.06 P99 TPOT (ms): 941.38 ---------------Inter-token Latency---------------- Mean ITL (ms): 139.51 Median ITL (ms): 89.13 P99 ITL (ms): 1671.31
H100 has much better perf for tp8... maybe something wrong with rocm...
H100, tp8, max_total_num_tokens=3745506, max_prefill_tokens=16384, max_running_requests=4097, context_len=8192 mi300x, tp8, max_total_num_tokens=9999714, max_prefill_tokens=16384, max_running_requests=4097, context_len=8192
mi300x. if parallelism is limited, perf is better.
python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B --tp-size 8 --disable-nan-detection --disable-disk-cache --enable-overlap-schedule --max-total-tokens 3745506
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4000 --random-input 128 --random-output 128
WARNING It is recommended to use the Chat
or Instruct
model for benchmarking.
Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='NousResearch/Meta-Llama-3-8B', tokenizer=None, num_prompts=4000, sharegpt_output_len=None, random_input_len=128, random_output_len=128, random_range_ratio=0.0, request_rate=inf, seed=1, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, extra_request_body=None)
Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4000/4000 [00:23<00:00, 171.78it/s]
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 23.29 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256791 Request throughput (req/s): 171.78 Input token throughput (tok/s): 11054.62 Output token throughput (tok/s): 11078.28 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 19869.75 Median E2E Latency (ms): 20904.41 ---------------Time to First Token---------------- Mean TTFT (ms): 7656.01 Median TTFT (ms): 7494.83 P99 TTFT (ms): 13581.48 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 371.22 Median TPOT (ms): 200.66 P99 TPOT (ms): 3732.02 ---------------Inter-token Latency---------------- Mean ITL (ms): 376.02 Median ITL (ms): 131.98 P99 ITL (ms): 7913.00
python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B --tp-size 8 --disable-nan-detection --disable-disk-cache --enable-overlap-schedule --max-total-tokens 5618259
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4000 --random-input 128 --random-output 128
WARNING It is recommended to use the Chat
or Instruct
model for benchmarking.
Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='NousResearch/Meta-Llama-3-8B', tokenizer=None, num_prompts=4000, sharegpt_output_len=None, random_input_len=128, random_output_len=128, random_range_ratio=0.0, request_rate=inf, seed=1, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, extra_request_body=None)
============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 31.11 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256832 Request throughput (req/s): 128.57 Input token throughput (tok/s): 8273.93 Output token throughput (tok/s): 8291.64 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 27610.12 Median E2E Latency (ms): 28874.21 ---------------Time to First Token---------------- Mean TTFT (ms): 11439.83 Median TTFT (ms): 11814.12 P99 TTFT (ms): 19957.21 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 526.06 Median TPOT (ms): 264.76 P99 TPOT (ms): 5435.55 ---------------Inter-token Latency---------------- Mean ITL (ms): 490.58 Median ITL (ms): 138.49 P99 ITL (ms): 11879.44
Do you have any findings or ideas to resolve the issue with the MI300? I'm happy to do a debugging session with you if you have any concrete findings or solutions.
@merrymercy @hliuca we will take a look too, we observed TP8 is comparably slower previously.
Checklist
Describe the bug
When I benchmark TP1, the throughput is great.
Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 22.18 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256866 Request throughput (req/s): 180.31 Input token throughput (tok/s): 11603.04 Output token throughput (tok/s): 11627.87
However if I test TP8, the performance is very poor, as the scheduling dominates the running time, Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 100.06 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256833 Request throughput (req/s): 39.97 Input token throughput (tok/s): 2572.48 Output token throughput (tok/s): 2577.98
Reproduction
TP1 python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B --tp-size 1 --disable-nan-detection --disable-disk-cache --enable-overlap-schedule
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4000 --random-input 128 --random-output 128
TP8 python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B --tp-size 8 --disable-nan-detection --disable-disk-cache --enable-overlap-schedule
python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4000 --random-input 128 --random-output 128
Environment
latest sglang on ROCm 6.2 and MI300X.