[Performance]: Using vLLM for Llama3.1 405b fp8 on 8xH100 yields poor throughput

jorgeantonio21 commented 2 months ago

Proposal to improve performance

No response

Report of performance regression

Following the blog post announcement, I tried to replicate these numbers, but I got much lower throughput than what is reported. I used the Llama3.1 405b fp8 on an 8xH100 setup.

I experimented with different total prompt count [32, 128, 256, 1024] and input vs output token lengths (both with [128, 256, 512, 1024]).

The total number of generated tokens per second was at maximum ~700 tokens/sec, which is much lower than what is reported in the blog post above (roughly ~3100 tokens/sec).

Also for prompt count of 1024 I get a panic with Cuda illegal memory access, which I would expect should not occur as the system should be able to manage running sequences, even for large queue lengths.

Prompt count: 32 Input Tokens: 128 Output Tokens: 512
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
32/32 [00:48<00:00,  1.53s/it, est. speed input: 28.13 toks/s, output: 335.00 toks/s]
Prompt count: 32 Input Tokens: 256 Output Tokens: 512
Processed prompts: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
32/32 [00:51<00:00,  1.60s/it, est. speed input: 53.68 toks/s, output: 319.57 toks/s]
Prompt count: 32 Input Tokens: 512 Output Tokens: 256
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
32/32 [00:31<00:00,  1.03it/s, est. speed input: 176.23 toks/s, output: 263.83 toks/s]
Prompt count: 32 Input Tokens: 1024 Output Tokens: 512
Processed prompts:   0%|                                                                                                                                                                                                                        | 0/32 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
32/32 [01:02<00:00,  1.96s/it, est. speed input: 174.67 toks/s, output: 261.50 toks/s]Prompt count: 128 Input Tokens: 128 Output Tokens: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
128/128 [01:48<00:00,  1.18it/s, est. speed input: 50.62 toks/s, output: 602.76 toks/s]
Prompt count: 128 Input Tokens: 256 Output Tokens: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
128/128 [01:56<00:00,  1.10it/s, est. speed input: 94.47 toks/s, output: 562.41 toks/s]
Prompt count: 128 Input Tokens: 512 Output Tokens: 256
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
128/128 [01:21<00:00,  1.57it/s, est. speed input: 268.67 toks/s, output: 402.22 toks/s]
Prompt count: 128 Input Tokens: 1024 Output Tokens: 512
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
128/128 [02:44<00:00,  1.28s/it, est. speed input: 266.56 toks/s, output: 399.06 toks/s]
Prompt count: 256 Input Tokens: 128 Output Tokens: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
256/256 [03:30<00:00,  1.22it/s, est. speed input: 52.33 toks/s, output: 623.07 toks/s]
Prompt count: 256 Input Tokens: 256 Output Tokens: 512
Processed prompts: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
256/256 [03:45<00:00,  1.13it/s, est. speed input: 97.60 toks/s, output: 581.05 toks/s]
Prompt count: 256 Input Tokens: 512 Output Tokens: 256
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
256/256 [02:38<00:00,  1.61it/s, est. speed input: 275.78 toks/s, output: 412.86 toks/s]
Prompt count: 256 Input Tokens: 1024 Output Tokens: 512
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 
256/256 [05:19<00:00,  1.25s/it, est. speed input: 273.63 toks/s, output: 409.64 toks/s]
Prompt count: 1024 Input Tokens: 128 Output Tokens: 512
Processed prompts:  25%|█████████████████████████████████████████████████▊                                                                                                                                                       | 254/1024 [03:45<01:25,  9.05it/s, est. speed input: 48.43 toks/s, output: 576.66 toks/s]Processed prompts:  50%|███████████████████████████████████████████████████████████████████████████████████████████████████▉                                                                                                     | 509/1024 [07:18<00:55,  9.34it/s, est. speed input: 49.92 toks/s, output: 594.37 toks/s][rank0]:[E902 14:02:28.575905740 ProcessGroupNCCL.cpp:1515] [PG 3 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

youkaichao commented 2 months ago

please report your environment and workload

jorgeantonio21 commented 2 months ago

@youkaichao, I used a machine with cuda 12.4.1, with Python3.12 on ubuntu 22.04.

youkaichao commented 2 months ago

what benchmark script do you use? we use https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py

Cifko commented 1 month ago

We tried our own basic approach. But when I tried today benchmark_throughput.py --input-len 1024 --output-len 128 --model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --num-prompts 500 --trust-remote-code --device cuda --output-json ../1024-128 I got GPU blocks 0, so it fails after 3 minutes of run. Any idea why? Can you recommend how to run this properly?

jorgeantonio21 commented 1 month ago

@youkaichao from @Cifko's question above, any suggestions on how to use the benchmark_throughput.py script in order to avoid these issues ?

youkaichao commented 1 month ago

I think you need to open a new issue to ask about how to run the model properly, with details about your environment.

vllm-project / vllm

[Performance]: Using vLLM for Llama3.1 405b fp8 on 8xH100 yields poor throughput #8244

Proposal to improve performance

Report of performance regression