Open ricard-borras-veriff opened 1 year ago
Are the pods deployed on the same GPU on restart? From the screenshot you attached, it seems like there are two GPUs in your k8s cluster. The models can have a different GPU memory footprint depending on the GPU they are deployed on.
do you mean same GPU unit or same GPU model? In our case, pods have access some a shared pool of gpus, all of the same type, and some of them can share gpu or not (depending on the available resources). This metric will reflect total gpu usage? So, if the model takes 2GB and is boot up in 2 pods sharing same GPU, this metric will reflect 4GB?
thanks
On Fri, Mar 17, 2023 at 4:41 PM Iman Tabrizian @.***> wrote:
Are the pods deployed on the same GPU on restart? From the screenshot you attached, it seems like there are two GPUs in your k8s cluster. The models can have a different GPU footprint depending on the GPU they are deployed on.
— Reply to this email directly, view it on GitHub https://github.com/triton-inference-server/server/issues/5513#issuecomment-1474032331, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4BWMDYW3HO4STRMJFPXZHDW4SA4HANCNFSM6AAAAAAV5OIHG4 . You are receiving this because you authored the thread.Message ID: @.***>
--
Ricard Borras
Senior Machine Learning Engineer
@.*** Av Diagonal 123, 9th floor, Barcelona - Spain
I meant the same GPU type. By any chance is it possible that multiple GPUs are exposed to the pod in one case and single GPU in the other case? What is the model configuration of the Python and ONNX model? If you are using KIND_GPU
, triton will create one model instance for each GPU so if the pod is scheduled on a multi-GPU system the memory usage could multiply by the number of GPUs used.
Hi
all the GPUS are the same type in all pods. Regarding model configuration files, I only specify KINDCPU for the python model, ONNX model is leaved without any KIND flag
thanks
On Mon, Mar 20, 2023 at 5:00 PM Iman Tabrizian @.***> wrote:
I meant the same GPU type. By any chance is it possible that multiple GPUs are exposed to the pod in one case and single GPU in the other case? What is the model configuration of the Python and ONNX model? If you are using KIND_GPU, triton will create one model instance for each GPU so if the pod is scheduled on a multi-GPU system the memory usage could multiply by the number of GPUs used.
— Reply to this email directly, view it on GitHub https://github.com/triton-inference-server/server/issues/5513#issuecomment-1476505626, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4BWMD22SKVVPDM3Q5YTALLW5B5KHANCNFSM6AAAAAAV5OIHG4 . You are receiving this because you authored the thread.Message ID: @.***>
--
Ricard Borras
Senior Machine Learning Engineer
@.*** Av Diagonal 123, 9th floor, Barcelona - Spain
Description
Deploying a Triton server to Kubernetes with some replicas, different pods allocate different GPU memory sizes. All pods point to the same model repository, which consists of:
After deploying, allocated GPU memory for each pod can be different. After restarting pods, GPU memory changes also. In attached screenshot, nv_gpu_memory_used_bytes prometheus metric is plotted for each pod after 3 restarts (each serie is a different pod id) and it can be seen how memory varies from <3GB (theorical memory consumption of the whole model repository) to almost 8GB
I have verified that reported metrics are correct by running nvidia-smi on some random pods.
This is the boot log of a given pod:
Triton Information
I am using Triton 22-12 docker image
Are you using the Triton container or did you build it yourself?
Triton container
To Reproduce
Restart pods to get different GPU allocated memory.
Expected behavior
All pods should take the same GPU memory and it should be constant after restarting them
It seems a bug in Triton server, could someone take a look?
Thanks!