Running multi-gpu and replicating models

triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.

https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html

BSD 3-Clause "New" or "Revised" License

8.37k stars 1.49k forks source link

Running multi-gpu and replicating models #7737

Open JoJoLev opened 4 weeks ago

JoJoLev commented 4 weeks ago

Currently have an LLM engine built on TensorRT-LLM. Trying to evaluate different setups and gains on types. Was trying to deploy the llama model on a multi-gpu, whereby between the 4 GPUs, I would have a copy of the model running on each. Is this possible with NVIDIA triton inference container?

rmccorm4 commented 3 weeks ago

Hi @JoJoLev, there is a guide for this exact use case here: https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md. Please let us know if this helps.