triton-inference-server / tensorrtllm_backend

The Triton TensorRT-LLM Backend
Apache License 2.0
663 stars 96 forks source link

How to Replicate and Serve Multiple Instances of a Model? #495

Closed KimMinSang96 closed 3 months ago

KimMinSang96 commented 3 months ago

I have 8 GPUs with 24GB of memory each, and I want to serve the Llama7B model. My goal is to replicate and load 3 instances of the model per GPU using INT8, allowing me to serve a total of 24 model instances across the 8 GPUs. However, I am unsure of the best way to achieve this. Can you help? My tensorRT-LLM branch is v0.7.0 :-)

KimMinSang96 commented 3 months ago

problem solved. I refer this one "https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md#orchestrator-mode"

LanceB57 commented 2 months ago

Hi there, could you clarify a bit about what you did? Did you make a separate model for each GPU (so 8 models total)? That's what I have right now and it works, but it seems unnecessary to have to make new models for each GPU.