vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.71k stars 4.66k forks source link

[Misc]: gpu-memory-utilization and Memory-Usage (by nvidia-smi) #7553

Open ChuanhongLi opened 3 months ago

ChuanhongLi commented 3 months ago

Anything you want to discuss about vllm.

I run the model on the server with 4 x NVIDIA GeForce RTX 4090 cards: CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --served-model-name Yi-1.5-34B-Chat --model /workspace/sdb/models/Yi/Yi-1.5-34B-Chat/ --port 8001 --gpu-memory-utilization 0.9 --swap-space 0 --tensor-parallel-size 4

Memory-Usage showed by nvidia-smi is 23859MiB / 24564MiB on each card; when I disable the cuda grapy by --enforce-eager, the memory-usage is changed to 21395MiB / 24564MiB (close to the value 24564 * 0.9).

23859 - 21395 = 2464 MiB is consumed by the cuda grapy. Too much! Is this normal?

When inference,batch size = 4(input length = 2k for each query),oom occurs.

BTW, when we set gpu-memory-utilization = 0.9, it means that 24564(for RTX 4090) * 0.9 can be used for model and kv cache?

vllm version: 0.5.4;

CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --served-model-name Yi-1.5-34B-Chat --model /workspace/sdb/models/Yi/Yi-1.5-34B-Chat/ --port 8001 --gpu-memory-utilization 0.9 --swap-space 0  --tensor-parallel-size 4

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:4F:00.0 Off |                  Off |
| 31%   26C    P8              10W / 450W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:52:00.0 Off |                  Off |
| 31%   28C    P8              14W / 450W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off | 00000000:56:00.0 Off |                  Off |
| 31%   27C    P8              13W / 450W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off | 00000000:57:00.0 Off |                  Off |
| 31%   28C    P8              12W / 450W |      2MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        Off | 00000000:CE:00.0 Off |                  Off |
| 31%   30C    P8               6W / 450W |  23859MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090        Off | 00000000:D1:00.0 Off |                  Off |
| 31%   31C    P8               8W / 450W |  23859MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090        Off | 00000000:D5:00.0 Off |                  Off |
| 31%   32C    P8               9W / 450W |  23859MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090        Off | 00000000:D6:00.0 Off |                  Off |
| 31%   32C    P8               4W / 450W |  23859MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m vllm.entrypoints.openai.api_server --served-model-name Yi-1.5-34B-Chat --model /workspace/sdb/models/Yi/Yi-1.5-34B-Chat/ --port 8001 --gpu-memory-utilization 0.9 --swap-space 0  --tensor-parallel-size 4 --enforce-eager

(base) [root@localhost models]# nvidia-smi
Thu Aug 15 02:55:11 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:4F:00.0 Off |                  Off |
| 31%   28C    P8              21W / 450W |    393MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:52:00.0 Off |                  Off |
| 31%   28C    P8              27W / 450W |      5MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 4090        Off | 00000000:56:00.0 Off |                  Off |
| 31%   35C    P2             110W / 450W |  17357MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 4090        Off | 00000000:57:00.0 Off |                  Off |
| 31%   28C    P8              26W / 450W |      5MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  NVIDIA GeForce RTX 4090        Off | 00000000:CE:00.0 Off |                  Off |
| 31%   28C    P8               6W / 450W |  21395MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  NVIDIA GeForce RTX 4090        Off | 00000000:D1:00.0 Off |                  Off |
| 31%   29C    P8               8W / 450W |  21395MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  NVIDIA GeForce RTX 4090        Off | 00000000:D5:00.0 Off |                  Off |
| 31%   30C    P8               8W / 450W |  21395MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  NVIDIA GeForce RTX 4090        Off | 00000000:D6:00.0 Off |                  Off |
| 31%   29C    P8               4W / 450W |  21395MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
KuntaiDu commented 3 months ago

If I remember correctly, GPU memory util = 0.9 means that after allocating spaces for model parameters, 90% of remaining GPU memory will be used to store KV caches.

ChuanhongLi commented 3 months ago

If I remember correctly, GPU memory util = 0.9 means that after allocating spaces for model parameters, 90% of remaining GPU memory will be used to store KV caches.

As the line 208(https://github.com/vllm-project/vllm/blob/main/vllm/worker/worker.py#L208) shows:

num_gpu_blocks = int(
            (total_gpu_memory * self.cache_config.gpu_memory_utilization -
             peak_memory) // cache_block_size)

total_gpu_memory * self.cache_config.gpu_memory_utilization is used for models and kv cache.

ChuanhongLi commented 3 months ago

@youkaichao Hi, do you have any idea about the memory consumed by the cuda grapy?

Memory-Usage showed by nvidia-smi is 23859MiB / 24564MiB on each card; when I disable the cuda grapy by --enforce-eager, the memory-usage is changed to 21395MiB / 24564MiB (close to the value 24564 * 0.9).

23859 - 21395 = 2464 MiB is consumed by the cuda grapy. Too much! Is this normal?
youkaichao commented 3 months ago

if you take a look at the log, you should notice:

CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing gpu_memory_utilization or enforcing eager mode. You can also reduce the max_num_seqs as needed to decrease memory usage.

that's the cost of running faster.

gamersover commented 3 months ago

I set the gpu_memory_utilization to 0.7. Initially, the memory usage was indeed at 0.7, but as time went on (after receiving many requests), I checked the memory usage again and it had increased to 0.95. There were no active requests when I checked the memory usage. When there are no requests, the memory usage should return to around 0.7, right?

github-actions[bot] commented 4 days ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!