triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.33k stars 1.48k forks source link

Triton x vLLM backend GPU selection issue #7786

Open Tedyang2003 opened 3 days ago

Tedyang2003 commented 3 days ago

Description I am currently using triton vllm backend for my kubernetes cluster. There are 2 GPUs that Triton is able to see, however it seems to only choose GPU 0 to load the model weights

I have set my config.pbtxt instance groups to be:

Model A

instance_group [ { count: 1 kind: KIND_GPU gpus: [0] } ]

Model B

instance_group [ { count: 1 kind: KIND_GPU gpus: [1] } ]

The expectation was for Model A to be loaded into the GPU with index 0 and Model B to be loaded to GPU with index 1. With the logs-verbose for triton turned on, I tracked and saw that triton was able to see the GPUs, Identify and Load the GPUs, however using "nvidia-smi" I could see that my models were being loaded only into gpu 0

Hence my thoughts/hypothesis on why it wasn't working is maybe due to the bridge between triton's GPU acceptance and vLLM's GPU acceptance having a bug in implementation.

Hence I took a look at triton vLLM's model.py file (particularly the validate_device_config method)

image

The method shows the identification of GPU to be used, however the only line to set the GPU is using torch.cuda.set_device().

From my research however on setting different GPUs for vLLM as an individual backend, they mentioned the use of CUDA_VISIBLE_DEVICES to be the only method of controlling the GPU selection, but I do not see the implementation here.

References:

I have done single GPU serving before where I did not select specific GPU, and my set up works fine, with my models functioning well.

**I am well aware of either controlling the GPU use using my own Kubernetes resource assigner (nvidia.com/gpu), as well as the use of tensor_parallel_size to split each model between GPUs.

My current goal seeking at least some confirmation as to why triton's supposed GPU selection in config.pbtxt is not working for vLLM.**

Triton Information What version of Triton are you using? The version I'm using is: "nvcr.io/nvidia/tritonserver:24.10-vllm-python-py3"

Are you using the Triton container or did you build it yourself? I am using a pre built container from the official NGC registry

To Reproduce Steps to reproduce the behavior. My set up uses OpenShift on top of Kubernetes so it may be challenging to recreate exactly. However I am just loading 2 models assigned with 2 separate GPUs on the same triton vLLM server

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Models:

Models configs are simple, bare minimum and non-ensemble, only with:

Expected behavior A clear and concise description of what you expected to happen.

Expect triton vLLM to load my models individually onto different GPUs

rmccorm4 commented 2 days ago

Hi @Tedyang2003, thanks for raising this issue!

Do you mind trying to replace that line you've identified

torch.cuda.set_device(triton_device_id)

with something like this:

os.environ["CUDA_VISIBLE_DEVICES"] = triton_device_id

and report back whether it behaves as you'd expect or not?

Tedyang2003 commented 2 days ago

Hi @rmccorm4, thanks for the prompt reply. Currently I am unable to do so due to my tight schedule. However I came across an early post that you replied to around in Feb regarding a similar issue.

https://github.com/triton-inference-server/server/issues/6855

The original poster stated in a reply to you that "Thank for your reply. Using KIND_GPU and set CUDA_VISIBLE_DEVICES before initializing vllm engine make it works as expected. I will try starting 4 instances with KIND_MODEL and parsing the model_instance_name."

Based on the support from this older poster and the documentation on how to set vLLM to choose GPU using CUDA_VISIBLE_DEVISES, I agree that it will likely work.

Currently I do not need an immediate fix for this as I can work with the current methods available to me to manage my GPU, however I am just curious on future updates for this bug. Seems like despite that post being quite a few months ago, there is still no official change to the releases of NGC images.

Just hope that you can maybe give me info on maybe when this will be an official bug fix so I can tell my immediate superiors. Thanks!