I have a server with only one NVLink connection, so I need to use pipeline parallelism and tensor parallelism within a single node to improve its performance. I would like to know how to specify the corresponding GPUs for this setup (since tensor parallelism requires [GPU0, GPU2] and [GPU1, GPU3], and pipeline parallelism should occur between [GPU0, GPU2] and [GPU1, GPU3]). How should I specify the api_server parameters to achieve this?
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...
[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Your current environment
I have a server with only one NVLink connection, so I need to use pipeline parallelism and tensor parallelism within a single node to improve its performance. I would like to know how to specify the corresponding GPUs for this setup (since tensor parallelism requires [GPU0, GPU2] and [GPU1, GPU3], and pipeline parallelism should occur between [GPU0, GPU2] and [GPU1, GPU3]). How should I specify the api_server parameters to achieve this?
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
Before submitting a new issue...