vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.41k stars 4.6k forks source link

CUDA out of memory error despite having enough memory #2554

Open varonroy opened 10 months ago

varonroy commented 10 months ago

I am trying to run mistralai/Mixtral-8x7B-Instruct-v0.1 on server with two A100s (a total of 80GB of GPU RAM).

$ python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --tensor-parallel-size 2

vLLM seems to fully utilize the memory of the GPUs and thus a CUDA out of memory error is thrown.

Here is the output of nvidia-smi.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:65:00.0 Off |                    0 |
| N/A   25C    P0              54W / 250W |  40327MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   25C    P0              50W / 250W |  40327MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    209623      C   ray::RayWorkerVllm                        40314MiB |
|    1   N/A  N/A    209624      C   ray::RayWorkerVllm                        40314MiB |
+---------------------------------------------------------------------------------------+

And here is the error.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 112.00 MiB. GPU 0 has a total capacty of 39.39 GiB of which 10.81 MiB is free. Including non-PyTorch memory, this process has 39.37 GiB memory in use. Of the allocated memory 38.78 GiB is allocated by PyTorch, and 17.81 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

When running the same model with Ollama, the model utilizes only 26GB of GPU RAM. Here is the output of nvidia-smi.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-PCIE-40GB          On  | 00000000:65:00.0 Off |                    0 |
| N/A   24C    P0              35W / 250W |  26591MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-PCIE-40GB          On  | 00000000:CA:00.0 Off |                    0 |
| N/A   23C    P0              30W / 250W |      4MiB / 40960MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A    210236      C   /bin/ollama                               26578MiB |
+---------------------------------------------------------------------------------------+

Additionally, according to their model page. This model requires 48GB of RAM.

Now, this might be an apples to oranges comparison, but should mistralai/Mixtral-8x7B-Instruct-v0.1, a 46.7B model take more than 80GB of RAM or is this a bug / some misconfiguration?

I have repeated this experiment with both vLLM version 0.2.6 and 0.2.7 and with various optional parameters such as --enforce-eager . The results haven't changed.

8bitaby commented 9 months ago

I have the same issue on A30 GPU.

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacty of 23.50 GiB of which 42.06 MiB is free. Including non-PyTorch memory, this process has 23.40 GiB memory in use. Of the allocated memory 22.99 GiB is allocated by PyTorch, and 1.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

flaviussn commented 9 months ago

I have the same problem using 2xA100 (40GB)

HasnainKhanNiazi commented 8 months ago

I am having the same problem as well, I have 2xA100 x 80GB and I am not able to load Mixtral-8x7B-Instruct-v0.1.

claudiocassimiro commented 7 months ago

Same problem here with an rtx 4090

sintatsu commented 6 months ago

I have the same problem using 2x NVIDIA L4 (48GB)

wangii commented 5 months ago

Same problem here with 4xA100

Daya-Jin commented 4 months ago

same problem. not any solution yet?

Daya-Jin commented 4 months ago

lower the gpu-memory-utilization works for me(8*A800 80GB).

python -m vllm.entrypoints.openai.api_server --model /Qwen-7B-Chat --dtype bfloat16 --api-key token-abc123 --trust-remote-code --gpu-memory-utilization 0.3 --max-model-len 4096

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!