Open varonroy opened 10 months ago
I have the same issue on A30 GPU.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacty of 23.50 GiB of which 42.06 MiB is free. Including non-PyTorch memory, this process has 23.40 GiB memory in use. Of the allocated memory 22.99 GiB is allocated by PyTorch, and 1.76 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.
I have the same problem using 2xA100 (40GB)
I am having the same problem as well, I have 2xA100 x 80GB and I am not able to load Mixtral-8x7B-Instruct-v0.1
.
Same problem here with an rtx 4090
I have the same problem using 2x NVIDIA L4 (48GB)
Same problem here with 4xA100
same problem. not any solution yet?
lower the gpu-memory-utilization works for me(8*A800 80GB).
python -m vllm.entrypoints.openai.api_server --model /Qwen-7B-Chat --dtype bfloat16 --api-key token-abc123 --trust-remote-code --gpu-memory-utilization 0.3 --max-model-len 4096
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
I am trying to run mistralai/Mixtral-8x7B-Instruct-v0.1 on server with two A100s (a total of 80GB of GPU RAM).
vLLM seems to fully utilize the memory of the GPUs and thus a
CUDA out of memory
error is thrown.Here is the output of
nvidia-smi
.And here is the error.
When running the same model with Ollama, the model utilizes only 26GB of GPU RAM. Here is the output of
nvidia-smi
.Additionally, according to their model page. This model requires 48GB of RAM.
Now, this might be an apples to oranges comparison, but should mistralai/Mixtral-8x7B-Instruct-v0.1, a 46.7B model take more than 80GB of RAM or is this a bug / some misconfiguration?
I have repeated this experiment with both vLLM version
0.2.6
and0.2.7
and with various optional parameters such as--enforce-eager
. The results haven't changed.