vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.36k stars 3.86k forks source link

[Feature]: need a GB-based alternative for gpu_memory_utilization #7524

Open stas00 opened 4 weeks ago

stas00 commented 4 weeks ago

🚀 The feature, motivation and pitch

I'm struggling to figure out how to extend our test suite to include vllm tests. The problem is that by default vllm will take over the whole gpu, which prevents running multiple parallel tests with pytest-xdist - as all tests but on the first worker will fail with OOM. The tests use tiny models so running 4-6 tests in parallel on a 24GB gpu works just fine w/o vllm.

Using gpu_memory_utilization isn't a solution because each test may require a specific amount of memory and it'd be very difficult to make the test suite efficient by doing a rough gpu_memory_utilization=1/num_workers because of cuda kernels and other various things and the possibility for the test suite run on different gpus of different sizes, as the tests run on CI with a certain gpu type and the developers use other gpu types when they develop the tests.

The parallelization run would work well if each test could tell how much memory it needs via say a new config setting gpu_memory_utilization_in_gbs - it probably shouldn't be very complicated to add since gpu_memory_utilization already goes through the process of calculating the actual number of GBs it can use for its allocations.

I think this feature would also be useful for disaggregation where more than one vllm server may run on the same gpu and instead of trying to guess what each slice of gpu_memory_utilization should be in %, one could use the actual memory in GBs.

Thank you.

youkaichao commented 4 weeks ago

you can allow gpu_memory_utilization > 1 to be used to specify GB, with an info message telling user what it does.

stas00 commented 4 weeks ago

oh, wow, that's great - thank you, @youkaichao!

Why is it not documented? https://docs.vllm.ai/en/latest/models/engine_args.html#engine-arguments

--gpu-memory-utilization

The fraction of GPU memory to be used for the model executor, which can range from 0 to 1. For example, a value of 0.5 would imply 50% GPU memory utilization. If unspecified, will use the default value of 0.9.

Default: 0.9

stas00 commented 4 weeks ago

It doesn't work:

E             File "/home/stas/anaconda3/envs/py310-pt22/lib/python3.10/site-packages/vllm/config.py", line 461, in __init__
E               self._verify_args()
E             File "/home/stas/anaconda3/envs/py310-pt22/lib/python3.10/site-packages/vllm/config.py", line 476, in _verify_args
E               raise ValueError(
E           ValueError: GPU memory utilization must be less than 1.0. Got 5.
youkaichao commented 4 weeks ago

I mean, you can create a PR to implement it.

stas00 commented 4 weeks ago

Do you propose to overload the existing variable to avoid 2 definitions and needing to check that both aren't defined?

Otherwise it's error-prone/ambiguous - does 1 mean 100% or 1GB? I think it should be a new config.

youkaichao commented 4 weeks ago

overloading the existing one looks good to me, as long as we keep the original semantic of [0, 1] unchanged.

stas00 commented 4 weeks ago

as I mentioned in my last comment your proposition has a definition problem - since 1 would then be an ambiguous definition. It could mean 100% or 1GB.

youkaichao commented 4 weeks ago

vLLM has the right to choose the interpretation. For vLLM, 1 means 100%. That's all. If users want 1GB, they can just put 1.000001 .

jinzhen-lin commented 3 weeks ago

Add units can be a better option:

This make it more flexible. If we want to support more special value (e.g. auto, min) in the future, it requires less change.