Currently have an LLM engine built on TensorRT-LLM. Trying to evaluate different setups and gains on types.
Was trying to deploy the llama model on a multi-gpu, whereby between the 4 GPUs, I would have a copy of the model running on each.
Is this possible with NVIDIA triton inference container?
Currently have an LLM engine built on TensorRT-LLM. Trying to evaluate different setups and gains on types. Was trying to deploy the llama model on a multi-gpu, whereby between the 4 GPUs, I would have a copy of the model running on each. Is this possible with NVIDIA triton inference container?