vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
31.12k stars 4.73k forks source link

[Usage]: Best engine arguments for large batch inference #8513

Open alpayariyak opened 2 months ago

alpayariyak commented 2 months ago

Your current environment

irrelevant

How would you like to use vllm

What would be the arguments that would maximize overall throughput for large batch offline inference? More specifically, I'm looking to generate 405B FP8 outputs for millions of inputs with 8x80 H100 SXM.

Thus far, I've been using the following arguments, but I womder if there are any others that would optimize this usecase, where per-request throughput and TTFT don't matter?

    --max-model-len 8192 \
    --disable-log-requests \
    --gpu_memory_utilization 0.93 \
    --use-v2-block-manager \
    --block-size 32 \
    --max-num-seqs 512

Before submitting a new issue...

dipatidar commented 2 months ago

@alpayariyak, I experimented with vllm config params for 405B FP8 on 8x80 H100 SXM. Setting --tensor-parallel-size 8 leverages all eight GPUs effectively, significantly boosting model throughput through parallel computation. does you data contain large input length/tokens? If so you can increase the --max-model-len 8192 . I was able to get 65k context length with --tensor-parallel-size 8 and by keeping other params to default values.

alpayariyak commented 2 months ago

@dipatidar thank you, but I'm already doing tp=8 and my max model len is adequate. The other parameters are what I'm asking about, as default values often try to balance per-request performance and ttft with overall throughput, and I only care about the latter.