triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.18k stars 1.46k forks source link

Multiple configuration files for the same model #4351

Open issamemari opened 2 years ago

issamemari commented 2 years ago

Is your feature request related to a problem? Please describe.

I would like to deploy the same model with different configurations. In my production environment I have different kinds of machines with different number of GPU devices, therefore, I would like to be able to configure instance_group differently for the same model depending on machine type.

For example:

In my current setup, there is a shared volume accessible from all my machines. This is where the model repository lies. I have the model repository in the shared volume to avoid copying the model files onto every machine for every deployment of Triton.

Describe the solution you'd like

I would like the name/path of the config.pbtxt file to be configurable when launching Triton. Something like this:

docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/path/to/model/repo:/models nvcr.io/nvidia/tritonserver:22.04-py3 tritonserver --model-repository=/models --model-control-mode=explicit --load-model my_model --default-config-filename config4gpus.pbtxt

Describe alternatives you've considered

I've considered duplicating the model directory for every model depending on the configuration file. It requires more manual work and results in unnecessary duplication of model files.

rmccorm4 commented 2 years ago

Hi @issamemari ,

I understand the ask here is more generic, but in the meantime in case your ask is specific to only GPU device management, you could make use of either --gpus flag to docker run or CUDA_VISIBLE_DEVICES env var when starting the container on each machine to isolate the GPUs as described in your example, and keep the config file simple to use all "available" gpus. Hope this helps.

issamemari commented 2 years ago

Hi @rmccorm4! Thank you for your response.

In my case I have many models to load in Triton and I would like to configure them to run on different GPUs.

For example:

My understanding is that --gpus or CUDA_VISIBLE_DEVICES limit the visible GPUs for all models so this doesn't work for my use case.

rmccorm4 commented 2 years ago

Ah I see, limiting per-model for multiple models on the same tritonserver process wouldn't work well with this approach. It would likely require more configuration and coordination on your end to manage multiple tritonserver containers with this GPU isolation (and port bindings) per machine than it would to simply duplicate model repositories with slightly different configs as you suggested (and would like to avoid). Thanks for sharing more details on the use case.

We filed a ticket [DLIS-4723] on this and look into it once it is prioritized.

kadmor commented 8 months ago

Hello. Has there been any progress on this task? I have a similar task, I want tritonserver to be deployed on different machines with different max_batch_size. I have a powerful machine for the master branch and a weak machine for the development branch.

dyastremsky commented 7 months ago

Hello. Has there been any progress on this task? I have a similar task, I want tritonserver to be deployed on different machines with different max_batch_size. I have a powerful machine for the master branch and a weak machine for the development branch.

While we work on this enhancement, you can use the model load API to load the configuration, so you can pass a different configuration on each machine.