sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.
https://sgl-project.github.io/
Apache License 2.0
5.89k stars 477 forks source link

TP8 scheduling overhead is very high for small model, Llama 3 8B #1857

Open hliuca opened 4 days ago

hliuca commented 4 days ago

Checklist

Describe the bug

When I benchmark TP1, the throughput is great.

Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 22.18 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256866 Request throughput (req/s): 180.31 Input token throughput (tok/s): 11603.04 Output token throughput (tok/s): 11627.87

However if I test TP8, the performance is very poor, as the scheduling dominates the running time, Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 100.06 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256833 Request throughput (req/s): 39.97 Input token throughput (tok/s): 2572.48 Output token throughput (tok/s): 2577.98

Reproduction

TP1 python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B --tp-size 1 --disable-nan-detection --disable-disk-cache --enable-overlap-schedule

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4000 --random-input 128 --random-output 128

TP8 python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B --tp-size 8 --disable-nan-detection --disable-disk-cache --enable-overlap-schedule

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4000 --random-input 128 --random-output 128

Environment

latest sglang on ROCm 6.2 and MI300X.

merrymercy commented 4 days ago

This is definitely not expected and very unusual. Did you also see this on NV GPUs? Do you have any details profiling results?

hliuca commented 3 days ago

Hi @merrymercy I will run on NV GPUs and update here. Thanks for looking into this.

hliuca commented 3 days ago

H100 TP1,

============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 18.49 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256849 Request throughput (req/s): 216.33 Input token throughput (tok/s): 13921.61 Output token throughput (tok/s): 13951.41 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 14500.46 Median E2E Latency (ms): 15651.88 ---------------Time to First Token---------------- Mean TTFT (ms): 4796.60 Median TTFT (ms): 4020.55 P99 TTFT (ms): 8216.55 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 243.34 Median TPOT (ms): 161.56 P99 TPOT (ms): 1695.36 ---------------Inter-token Latency---------------- Mean ITL (ms): 152.84 Median ITL (ms): 85.30 P99 ITL (ms): 2405.32

TP8, ============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 17.16 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256818 Request throughput (req/s): 233.06 Input token throughput (tok/s): 14997.93 Output token throughput (tok/s): 15030.04 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 13159.01 Median E2E Latency (ms): 14298.66 ---------------Time to First Token---------------- Mean TTFT (ms): 4301.37 Median TTFT (ms): 3996.11 P99 TTFT (ms): 5583.20 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 197.62 Median TPOT (ms): 152.06 P99 TPOT (ms): 941.38 ---------------Inter-token Latency---------------- Mean ITL (ms): 139.51 Median ITL (ms): 89.13 P99 ITL (ms): 1671.31

hliuca commented 3 days ago

H100 has much better perf for tp8... maybe something wrong with rocm...

hliuca commented 3 days ago

H100, tp8, max_total_num_tokens=3745506, max_prefill_tokens=16384, max_running_requests=4097, context_len=8192 mi300x, tp8, max_total_num_tokens=9999714, max_prefill_tokens=16384, max_running_requests=4097, context_len=8192

hliuca commented 3 days ago

mi300x. if parallelism is limited, perf is better.

python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B --tp-size 8 --disable-nan-detection --disable-disk-cache --enable-overlap-schedule --max-total-tokens 3745506

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4000 --random-input 128 --random-output 128

WARNING It is recommended to use the Chat or Instruct model for benchmarking. Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.

Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='NousResearch/Meta-Llama-3-8B', tokenizer=None, num_prompts=4000, sharegpt_output_len=None, random_input_len=128, random_output_len=128, random_range_ratio=0.0, request_rate=inf, seed=1, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, extra_request_body=None)

Input tokens: 257409

Output tokens: 257960

Starting initial single prompt test run... Initial test run completed. Starting main benchmark run... 100%|███████████████████████████████████████████████████████████████████████████████████████████| 4000/4000 [00:23<00:00, 171.78it/s]

============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 23.29 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256791 Request throughput (req/s): 171.78 Input token throughput (tok/s): 11054.62 Output token throughput (tok/s): 11078.28 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 19869.75 Median E2E Latency (ms): 20904.41 ---------------Time to First Token---------------- Mean TTFT (ms): 7656.01 Median TTFT (ms): 7494.83 P99 TTFT (ms): 13581.48 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 371.22 Median TPOT (ms): 200.66 P99 TPOT (ms): 3732.02 ---------------Inter-token Latency---------------- Mean ITL (ms): 376.02 Median ITL (ms): 131.98 P99 ITL (ms): 7913.00

hliuca commented 3 days ago

python -m sglang.launch_server --model-path NousResearch/Meta-Llama-3-8B --tp-size 8 --disable-nan-detection --disable-disk-cache --enable-overlap-schedule --max-total-tokens 5618259

python3 -m sglang.bench_serving --backend sglang --dataset-name random --num-prompts 4000 --random-input 128 --random-output 128

WARNING It is recommended to use the Chat or Instruct model for benchmarking. Because when the tokenizer counts the output tokens, if there is gibberish, it might count incorrectly.

Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='', model='NousResearch/Meta-Llama-3-8B', tokenizer=None, num_prompts=4000, sharegpt_output_len=None, random_input_len=128, random_output_len=128, random_range_ratio=0.0, request_rate=inf, seed=1, multi=False, request_rate_range='2,34,2', output_file=None, disable_tqdm=False, disable_stream=False, disable_ignore_eos=False, extra_request_body=None)

============ Serving Benchmark Result ============ Backend: sglang Traffic request rate: inf Successful requests: 4000 Benchmark duration (s): 31.11 Total input tokens: 257409 Total generated tokens: 257960 Total generated tokens (retokenized): 256832 Request throughput (req/s): 128.57 Input token throughput (tok/s): 8273.93 Output token throughput (tok/s): 8291.64 ----------------End-to-End Latency---------------- Mean E2E Latency (ms): 27610.12 Median E2E Latency (ms): 28874.21 ---------------Time to First Token---------------- Mean TTFT (ms): 11439.83 Median TTFT (ms): 11814.12 P99 TTFT (ms): 19957.21 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 526.06 Median TPOT (ms): 264.76 P99 TPOT (ms): 5435.55 ---------------Inter-token Latency---------------- Mean ITL (ms): 490.58 Median ITL (ms): 138.49 P99 ITL (ms): 11879.44

merrymercy commented 2 days ago

Do you have any findings or ideas to resolve the issue with the MI300? I'm happy to do a debugging session with you if you have any concrete findings or solutions.

HaiShaw commented 2 hours ago

@merrymercy @hliuca we will take a look too, we observed TP8 is comparably slower previously.