vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.65k stars 4.08k forks source link

[Usage]: How to specify certain GPUs for Tensor Parallelism and Pipeline Parallelism #7958

Open henry-y opened 1 month ago

henry-y commented 1 month ago

Your current environment

I have a server with only one NVLink connection, so I need to use pipeline parallelism and tensor parallelism within a single node to improve its performance. I would like to know how to specify the corresponding GPUs for this setup (since tensor parallelism requires [GPU0, GPU2] and [GPU1, GPU3], and pipeline parallelism should occur between [GPU0, GPU2] and [GPU1, GPU3]). How should I specify the api_server parameters to achieve this?

How would you like to use vllm

I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.

Before submitting a new issue...

youkaichao commented 1 month ago

-tp 2 -pp 2 should be enough.