Open AlphaINF opened 1 month ago
I'm not familiar with the lora part. Without lora, cudagraph should not consume too much memory (less than 1GB I think). cc @Yard1 for lora related question.
Increased memory usage is expected when using LoRA as it preallocates GPU buffers to store the LoRA weights in. I don't think it would change CUDA graph memory consumption.
Your current environment
None
How would you like to use vllm
I hope to deploy the llama3-70b model on a server with 8 3090 GPUs. When I enable the enable_lora switch, the system will definitely exceed the memory limit (even if the context length is reduced to 128) as long as I do not enable the enforce_eager flag. However, when I disable enable_lora, it takes about 85% of the memory to run. I would like to know the difference in memory consumption of CUDA graph when lora is enabled and when it is not.
In this situation, how can I enable CUDA graph acceleration for the model without exceeding the memory limit?