vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.14k stars 3.98k forks source link

[Usage]: I have two Gpus, how do I make my model run on 2 gpus #3908

Closed hxujal closed 5 months ago

hxujal commented 5 months ago

Your current environment

python -m vllm.entrypoints.openai.api_server --served-model-name Qwen1.5-0.5B-Chat --model /home/project/models/qwen-0.5b

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

hxujal commented 5 months ago

image I couldn't load the model using 1 gpu

jeejeelee commented 5 months ago

set --tensor-parallel-size 2

peacefulluo commented 5 months ago

How do I specify the gpu to run? For example, only cuda:0 or cuda:1 can be selected

jeejeelee commented 5 months ago

@peacefulluo FYI https://github.com/vllm-project/vllm/issues/2387

peacefulluo commented 5 months ago

@peacefulluo FYI #2387 thank you

ANYMS-A commented 4 months ago

set --tensor-parallel-size 2

Hi, I'd like to know what kind of advantage it will bring if I run my LLM on 2 GPUs? Faster speed or it just split the model into 2 parts onto different devices?