Open ChuanhongLi opened 3 months ago
If I remember correctly, GPU memory util = 0.9 means that after allocating spaces for model parameters, 90% of remaining GPU memory will be used to store KV caches.
If I remember correctly, GPU memory util = 0.9 means that after allocating spaces for model parameters, 90% of remaining GPU memory will be used to store KV caches.
As the line 208(https://github.com/vllm-project/vllm/blob/main/vllm/worker/worker.py#L208) shows:
num_gpu_blocks = int(
(total_gpu_memory * self.cache_config.gpu_memory_utilization -
peak_memory) // cache_block_size)
total_gpu_memory * self.cache_config.gpu_memory_utilization is used for models and kv cache.
@youkaichao Hi, do you have any idea about the memory consumed by the cuda grapy?
Memory-Usage showed by nvidia-smi is 23859MiB / 24564MiB on each card; when I disable the cuda grapy by --enforce-eager, the memory-usage is changed to 21395MiB / 24564MiB (close to the value 24564 * 0.9).
23859 - 21395 = 2464 MiB is consumed by the cuda grapy. Too much! Is this normal?
if you take a look at the log, you should notice:
CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing
gpu_memory_utilization
or enforcing eager mode. You can also reduce themax_num_seqs
as needed to decrease memory usage.
that's the cost of running faster.
I set the gpu_memory_utilization to 0.7. Initially, the memory usage was indeed at 0.7, but as time went on (after receiving many requests), I checked the memory usage again and it had increased to 0.95. There were no active requests when I checked the memory usage. When there are no requests, the memory usage should return to around 0.7, right?
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Anything you want to discuss about vllm.
I run the model on the server with 4 x NVIDIA GeForce RTX 4090 cards: CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --served-model-name Yi-1.5-34B-Chat --model /workspace/sdb/models/Yi/Yi-1.5-34B-Chat/ --port 8001 --gpu-memory-utilization 0.9 --swap-space 0 --tensor-parallel-size 4
Memory-Usage showed by nvidia-smi is 23859MiB / 24564MiB on each card; when I disable the cuda grapy by --enforce-eager, the memory-usage is changed to 21395MiB / 24564MiB (close to the value 24564 * 0.9).
23859 - 21395 = 2464 MiB is consumed by the cuda grapy. Too much! Is this normal?
When inference,batch size = 4(input length = 2k for each query),oom occurs.
BTW, when we set gpu-memory-utilization = 0.9, it means that 24564(for RTX 4090) * 0.9 can be used for model and kv cache?
vllm version: 0.5.4;