vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.56k stars 3.89k forks source link

[Bug]: When I use the `python -m vllm.entrypoints.openai.api_server` command, cannot use multiple gpus #7538

Open YinSonglin1997 opened 4 weeks ago

YinSonglin1997 commented 4 weeks ago

Your current environment

The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.35 Python version: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.4.0-186-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.2.91 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-PCIE-40GB GPU 1: NVIDIA A100-PCIE-40GB GPU 2: NVIDIA A100-PCIE-40GB GPU 3: NVIDIA A100-PCIE-40GB Nvidia driver version: 535.54.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True ```

🐛 Describe the bug

I have 4 A100s on my machine and I want to deploy the llama3.1-70B with them. I used the following command: CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --model /ldata/llms/Meta-Llama-3.1-70B-Instruct --device auto --dtype auto --api-key CPMAPI But there is no multi-card execution, It reported an error: torch.OutOfMemoryError: CUDA out of memory. what should I do?

DarkLight1337 commented 4 weeks ago

Try setting the --tensor-parallel-size argument to the desired number of GPUs to use.