vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.37k stars 3.86k forks source link

[Misc]: gpu-memory-utilization and Memory-Usage (by nvidia-smi) #7553

Open ChuanhongLi opened 3 weeks ago

ChuanhongLi commented 3 weeks ago

Anything you want to discuss about vllm.

I run the model on the server with 4 x NVIDIA GeForce RTX 4090 cards: CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --served-model-name Yi-1.5-34B-Chat --model /workspace/sdb/models/Yi/Yi-1.5-34B-Chat/ --port 8001 --gpu-memory-utilization 0.9 --swap-space 0 --tensor-parallel-size 4

Memory-Usage showed by nvidia-smi is 23859MiB / 24564MiB on each card; when I disable the cuda grapy by --enforce-eager, the memory-usage is changed to 21395MiB / 24564MiB (close to the value 24564 * 0.9).

23859 - 21395 = 2464 MiB is consumed by the cuda grapy. Too much! Is this normal?

When inference,batch size = 4(input length = 2k for each query),oom occurs.

BTW, when we set gpu-memory-utilization = 0.9, it means that 24564(for RTX 4090) * 0.9 can be used for model and kv cache?

vllm version: 0.5.4;

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --served-model-name Yi-1.5-34B-Chat --model /workspace/sdb/models/Yi/Yi-1.5-34B-Chat/ --port 8001 --gpu-memory-utilization 0.9 --swap-space 0  --tensor-parallel-size 4

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:4F:00.0 Off |                  Off |
| 31%   26C    P8              10W / 450W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:52:00.0 Off |                  Off |
| 31%   28C    P8              14W / 450W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off | 00000000:56:00.0 Off |                  Off |
| 31%   27C    P8              13W / 450W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off | 00000000:57:00.0 Off |                  Off |
| 31%   28C    P8              12W / 450W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        Off | 00000000:CE:00.0 Off |                  Off |
| 31%   30C    P8               6W / 450W |  23859MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090        Off | 00000000:D1:00.0 Off |                  Off |
| 31%   31C    P8               8W / 450W |  23859MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090        Off | 00000000:D5:00.0 Off |                  Off |
| 31%   32C    P8               9W / 450W |  23859MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090        Off | 00000000:D6:00.0 Off |                  Off |
| 31%   32C    P8               4W / 450W |  23859MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --served-model-name Yi-1.5-34B-Chat --model /workspace/sdb/models/Yi/Yi-1.5-34B-Chat/ --port 8001 --gpu-memory-utilization 0.9 --swap-space 0  --tensor-parallel-size 4 --enforce-eager

(base) [root@localhost models]# nvidia-smi
Thu Aug 15 02:55:11 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:4F:00.0 Off |                  Off |
| 31%   28C    P8              21W / 450W |    393MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:52:00.0 Off |                  Off |
| 31%   28C    P8              27W / 450W |      5MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off | 00000000:56:00.0 Off |                  Off |
| 31%   35C    P2             110W / 450W |  17357MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off | 00000000:57:00.0 Off |                  Off |
| 31%   28C    P8              26W / 450W |      5MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        Off | 00000000:CE:00.0 Off |                  Off |
| 31%   28C    P8               6W / 450W |  21395MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090        Off | 00000000:D1:00.0 Off |                  Off |
| 31%   29C    P8               8W / 450W |  21395MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090        Off | 00000000:D5:00.0 Off |                  Off |
| 31%   30C    P8               8W / 450W |  21395MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090        Off | 00000000:D6:00.0 Off |                  Off |
| 31%   29C    P8               4W / 450W |  21395MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
KuntaiDu commented 3 weeks ago

If I remember correctly, GPU memory util = 0.9 means that after allocating spaces for model parameters, 90% of remaining GPU memory will be used to store KV caches.

ChuanhongLi commented 3 weeks ago

If I remember correctly, GPU memory util = 0.9 means that after allocating spaces for model parameters, 90% of remaining GPU memory will be used to store KV caches.

As the line 208(https://github.com/vllm-project/vllm/blob/main/vllm/worker/worker.py#L208) shows:

num_gpu_blocks = int(
            (total_gpu_memory * self.cache_config.gpu_memory_utilization -
             peak_memory) // cache_block_size)

total_gpu_memory * self.cache_config.gpu_memory_utilization is used for models and kv cache.

ChuanhongLi commented 3 weeks ago

@youkaichao Hi, do you have any idea about the memory consumed by the cuda grapy?

Memory-Usage showed by nvidia-smi is 23859MiB / 24564MiB on each card; when I disable the cuda grapy by --enforce-eager, the memory-usage is changed to 21395MiB / 24564MiB (close to the value 24564 * 0.9).

23859 - 21395 = 2464 MiB is consumed by the cuda grapy. Too much! Is this normal?
youkaichao commented 3 weeks ago

if you take a look at the log, you should notice:

CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.

that's the cost of running faster.

gamersover commented 3 weeks ago

I set the gpu_memory_utilization to 0.7. Initially, the memory usage was indeed at 0.7, but as time went on (after receiving many requests), I checked the memory usage again and it had increased to 0.95. There were no active requests when I checked the memory usage. When there are no requests, the memory usage should return to around 0.7, right?