vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.61k stars 3.71k forks source link

[Usage]: Is it possible to pin `LLM` to a specific CUDA device? #3750

Open mgerstgrasser opened 5 months ago

mgerstgrasser commented 5 months ago

Your current environment

-

How would you like to use vllm

I'd like to use multiple vllm instances in the same python script, each on a different CUDA device. Is it possible to pin an LLM object to a specific device? I don't see any option to do this anywhere in LLM nor in EngineArgs.

I understand it would be slightly tricky to do this with tensor parallelism over Ray, but if each LLM is using only a single GPU, it should be relatively easy to pass in a cuda device to be used by the model, I think?

Curious if this is already possible somehow, or if the vllm team would be open to a PR on this!

mgerstgrasser commented 5 months ago

To add: Simply passing device="cuda:3" to LLM() does not work. The model will end up on the correct device, but some input tensors (even while initializing) will still be on cuda:0, and forward passes will fail.

thefirebanks commented 5 months ago

Would setting the number of visible GPUs (os.environ["CUDA_VISIBLE_DEVICES"] = "3") a good workaround? or do you mean something different?

mgerstgrasser commented 5 months ago

Would setting the number of visible GPUs (os.environ["CUDA_VISIBLE_DEVICES"] = "3") a good workaround? or do you mean something different?

No. To be clear, I want multiple LLM instances in the same python process. I don't think there's any way to modify the visible devices like that within the same process, is there?

What I want to do is something like

llm1 = LLM(...,device="cuda:0")
llm2 = LLM(...,device="cuda:1")
some_string = llm1.generate(...).choices[0].text + llm2.generate(...).choices[0].text
youkaichao commented 5 months ago

You can do something more transparent:

export CUDA_VISIBLE_DEVICES=6
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m
export CUDA_VISIBLE_DEVICES=3
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --port 8080

And this is my nvidia-smi:

Sat Mar 30 14:28:36 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB-LS        On  | 00000000:06:00.0 Off |                    0 |
| N/A   34C    P0              42W / 250W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2-32GB-LS        On  | 00000000:07:00.0 Off |                    0 |
| N/A   37C    P0              42W / 250W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2-32GB-LS        On  | 00000000:0A:00.0 Off |                    0 |
| N/A   36C    P0              41W / 250W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2-32GB-LS        On  | 00000000:0B:00.0 Off |                    0 |
| N/A   35C    P0              53W / 250W |  29337MiB / 32768MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2-32GB-LS        On  | 00000000:85:00.0 Off |                    0 |
| N/A   35C    P0              41W / 250W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2-32GB-LS        On  | 00000000:86:00.0 Off |                    0 |
| N/A   37C    P0              42W / 250W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2-32GB-LS        On  | 00000000:89:00.0 Off |                    0 |
| N/A   38C    P0              46W / 250W |  29603MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2-32GB-LS        On  | 00000000:8A:00.0 Off |                    0 |
| N/A   36C    P0              43W / 250W |      0MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    3   N/A  N/A   3470867      C   python                                    29322MiB |
|    6   N/A  N/A   3470235      C   python                                    29588MiB |
mgerstgrasser commented 5 months ago

You can do something more transparent:

export CUDA_VISIBLE_DEVICES=6
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m
export CUDA_VISIBLE_DEVICES=3
python -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --port 8080

Haha, yes, that's actually exactly what I've been doing as a workaround, but there are disadvantages to doing that, so I wonder if there's a way to not have to go through the API server.

youkaichao commented 5 months ago

I don't see any better method. In fact I think this is the best method. It is not limited to api server. It should work however you use vLLM. Just execute your command twice with different CUDA_VISIBLE_DEVICES.

RuixiangZhao commented 2 months ago

Would setting the number of visible GPUs (os.environ["CUDA_VISIBLE_DEVICES"] = "3") a good workaround? or do you mean something different?

No. To be clear, I want multiple LLM instances in the same python process. I don't think there's any way to modify the visible devices like that within the same process, is there?

What I want to do is something like

llm1 = LLM(...,device="cuda:0")
llm2 = LLM(...,device="cuda:1")
some_string = llm1.generate(...).choices[0].text + llm2.generate(...).choices[0].text

@mgerstgrasser I have the same needs, have you found a solution?

Before using VLLM, I can specify the gpu using the following command:

model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, local_files_only=True).to('cuda:3')
mgerstgrasser commented 2 months ago

@mgerstgrasser I have the same needs, have you found a solution?

I haven't found a way to pin an LLM object to a device. But as per the messages above, you can instead start an API server instead, it seems that's the best / only way to do this.