Open jorgeantonio21 opened 2 months ago
please report your environment and workload
@youkaichao, I used a machine with cuda 12.4.1, with Python3.12 on ubuntu 22.04.
what benchmark script do you use? we use https://github.com/vllm-project/vllm/blob/main/benchmarks/benchmark_serving.py
We tried our own basic approach. But when I tried today benchmark_throughput.py --input-len 1024 --output-len 128 --model meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tensor-parallel-size 8 --num-prompts 500 --trust-remote-code --device cuda --output-json ../1024-128
I got GPU blocks 0, so it fails after 3 minutes of run. Any idea why? Can you recommend how to run this properly?
@youkaichao from @Cifko's question above, any suggestions on how to use the benchmark_throughput.py
script in order to avoid these issues ?
I think you need to open a new issue to ask about how to run the model properly, with details about your environment.
Proposal to improve performance
No response
Report of performance regression
Following the blog post announcement, I tried to replicate these numbers, but I got much lower throughput than what is reported. I used the Llama3.1 405b fp8 on an 8xH100 setup.
I experimented with different total prompt count [32, 128, 256, 1024] and input vs output token lengths (both with [128, 256, 512, 1024]).
The total number of generated tokens per second was at maximum ~700 tokens/sec, which is much lower than what is reported in the blog post above (roughly ~3100 tokens/sec).
Also for prompt count of 1024 I get a panic with Cuda illegal memory access, which I would expect should not occur as the system should be able to manage running sequences, even for large queue lengths.