vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.55k stars 3.37k forks source link

[Usage]: How can I deploy llama3-70b on a server with 8 3090 GPUs with lora and CUDA graph. #5193

Open AlphaINF opened 1 month ago

AlphaINF commented 1 month ago

Your current environment

None

How would you like to use vllm

I hope to deploy the llama3-70b model on a server with 8 3090 GPUs. When I enable the enable_lora switch, the system will definitely exceed the memory limit (even if the context length is reduced to 128) as long as I do not enable the enforce_eager flag. However, when I disable enable_lora, it takes about 85% of the memory to run. I would like to know the difference in memory consumption of CUDA graph when lora is enabled and when it is not.

In this situation, how can I enable CUDA graph acceleration for the model without exceeding the memory limit?

youkaichao commented 1 month ago

I'm not familiar with the lora part. Without lora, cudagraph should not consume too much memory (less than 1GB I think). cc @Yard1 for lora related question.

Yard1 commented 1 month ago

Increased memory usage is expected when using LoRA as it preallocates GPU buffers to store the LoRA weights in. I don't think it would change CUDA graph memory consumption.