Closed KimMinSang96 closed 3 months ago
problem solved. I refer this one "https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/docs/llama_multi_instance.md#orchestrator-mode"
Hi there, could you clarify a bit about what you did? Did you make a separate model for each GPU (so 8 models total)? That's what I have right now and it works, but it seems unnecessary to have to make new models for each GPU.
I have 8 GPUs with 24GB of memory each, and I want to serve the Llama7B model. My goal is to replicate and load 3 instances of the model per GPU using INT8, allowing me to serve a total of 24 model instances across the 8 GPUs. However, I am unsure of the best way to achieve this. Can you help? My tensorRT-LLM branch is v0.7.0 :-)