Open alpayariyak opened 2 months ago
@alpayariyak, I experimented with vllm config params for 405B FP8 on 8x80 H100 SXM. Setting --tensor-parallel-size 8 leverages all eight GPUs effectively, significantly boosting model throughput through parallel computation. does you data contain large input length/tokens? If so you can increase the --max-model-len 8192 . I was able to get 65k context length with --tensor-parallel-size 8 and by keeping other params to default values.
@dipatidar thank you, but I'm already doing tp=8 and my max model len is adequate. The other parameters are what I'm asking about, as default values often try to balance per-request performance and ttft with overall throughput, and I only care about the latter.
Your current environment
irrelevant
How would you like to use vllm
What would be the arguments that would maximize overall throughput for large batch offline inference? More specifically, I'm looking to generate 405B FP8 outputs for millions of inputs with 8x80 H100 SXM.
Thus far, I've been using the following arguments, but I womder if there are any others that would optimize this usecase, where per-request throughput and TTFT don't matter?
Before submitting a new issue...