High GPU consumption when deploying into Kubernetes

ricard-borras-veriff commented 1 year ago

Description

Deploying a Triton server to Kubernetes with some replicas, different pods allocate different GPU memory sizes. All pods point to the same model repository, which consists of:

Python model to do preprocessing
ONNX model to do inferencing
Ensemble model to run previous steps

After deploying, allocated GPU memory for each pod can be different. After restarting pods, GPU memory changes also. In attached screenshot, nv_gpu_memory_used_bytes prometheus metric is plotted for each pod after 3 restarts (each serie is a different pod id) and it can be seen how memory varies from <3GB (theorical memory consumption of the whole model repository) to almost 8GB

Screenshot 2023-03-16 at 13 21 49

I have verified that reported metrics are correct by running nvidia-smi on some random pods.

This is the boot log of a given pod:

[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:50.798757701Z I0316 14:16:50.798667 93 python_be.cc:1856] TRITONBACKEND_ModelInstanceInitialize: face_detection_preprocess_0_3 (CPU device 0)
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.098999922Z I0316 14:16:51.098899 93 model_lifecycle.cc:694] successfully loaded 'face_detection_preprocess' version 1
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.161639603Z I0316 14:16:51.161526 93 model_lifecycle.cc:459] loading: face_detection_pipeline:1
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.161879777Z I0316 14:16:51.161801 93 model_lifecycle.cc:694] successfully loaded 'face_detection_pipeline' version 1
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.161961315Z I0316 14:16:51.161892 93 server.cc:563]
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.161970300Z +------------------+------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.161981874Z | Repository Agent | Path |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.161984129Z +------------------+------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.161986205Z +------------------+------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.161988228Z
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162026765Z I0316 14:16:51.161960 93 server.cc:590]
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162037806Z +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162042155Z | Backend     | Path                                                            | Config                                                                                                                                                        |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162046410Z +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162049149Z | python      | /opt/tritonserver/backends/python/libtriton_python.so           | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162052380Z | onnxruntime | /opt/tritonserver/backends/onnxruntime/libtriton_onnxruntime.so | {"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}} |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162055253Z +-------------+-----------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162057740Z
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162072073Z I0316 14:16:51.162015 93 server.cc:633]
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162081202Z +---------------------------+---------+--------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162083651Z | Model                     | Version | Status |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162085896Z +---------------------------+---------+--------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162088012Z | face_detection_inferencer | 2       | READY  |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162090021Z | face_detection_pipeline   | 1       | READY  |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162092104Z | face_detection_preprocess | 1       | READY  |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162094152Z +---------------------------+---------+--------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.162096919Z
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.173611557Z I0316 14:16:51.173524 93 metrics.cc:864] Collecting metrics for GPU 0: Tesla T4
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.173874701Z I0316 14:16:51.173820 93 metrics.cc:757] Collecting CPU metrics
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174064230Z I0316 14:16:51.173996 93 tritonserver.cc:2264]
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174073755Z +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174084505Z | Option                           | Value                                                                                                                                                                                                |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174087524Z +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174089871Z | server_id                        | triton                                                                                                                                                                                               |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174092373Z | server_version                   | 2.29.0                                                                                                                                                                                               |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174094803Z | server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174097214Z | model_repository_path[0]         | s3://.........                                                                                                                                                          |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174099600Z | model_control_mode               | MODE_NONE                                                                                                                                                                                            |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174102485Z | strict_model_config              | 0                                                                                                                                                                                                    |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174104876Z | rate_limit                       | OFF                                                                                                                                                                                                  |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174107311Z | pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                            |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174109637Z | cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                             |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174111983Z | response_cache_byte_size         | 0                                                                                                                                                                                                    |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174114312Z | min_supported_compute_capability | 6.0                                                                                                                                                                                                  |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174117320Z | strict_readiness                 | 1                                                                                                                                                                                                    |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174123985Z | exit_timeout                     | 30                                                                                                                                                                                                   |
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174127876Z +----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174131496Z
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.174291199Z I0316 14:16:51.174243 93 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
[pod/face-detection-api-f9d775b87-n8z9x/face-detection-api] 2023-03-16T14:16:51.215467797Z I0316 14:16:51.215353 93 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002

Triton Information

I am using Triton 22-12 docker image

Are you using the Triton container or did you build it yourself?

Triton container

To Reproduce

Restart pods to get different GPU allocated memory.

Expected behavior

All pods should take the same GPU memory and it should be constant after restarting them

It seems a bug in Triton server, could someone take a look?

Thanks!

Tabrizian commented 1 year ago

Are the pods deployed on the same GPU on restart? From the screenshot you attached, it seems like there are two GPUs in your k8s cluster. The models can have a different GPU memory footprint depending on the GPU they are deployed on.

ricard-borras-veriff commented 1 year ago

do you mean same GPU unit or same GPU model? In our case, pods have access some a shared pool of gpus, all of the same type, and some of them can share gpu or not (depending on the available resources). This metric will reflect total gpu usage? So, if the model takes 2GB and is boot up in 2 pods sharing same GPU, this metric will reflect 4GB?

thanks

On Fri, Mar 17, 2023 at 4:41 PM Iman Tabrizian @.***> wrote:

Are the pods deployed on the same GPU on restart? From the screenshot you attached, it seems like there are two GPUs in your k8s cluster. The models can have a different GPU footprint depending on the GPU they are deployed on.

— Reply to this email directly, view it on GitHub https://github.com/triton-inference-server/server/issues/5513#issuecomment-1474032331, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4BWMDYW3HO4STRMJFPXZHDW4SA4HANCNFSM6AAAAAAV5OIHG4 . You are receiving this because you authored the thread.Message ID: @.***>

--

Ricard Borras

Senior Machine Learning Engineer

@.*** Av Diagonal 123, 9th floor, Barcelona - Spain

Tabrizian commented 1 year ago

I meant the same GPU type. By any chance is it possible that multiple GPUs are exposed to the pod in one case and single GPU in the other case? What is the model configuration of the Python and ONNX model? If you are using KIND_GPU, triton will create one model instance for each GPU so if the pod is scheduled on a multi-GPU system the memory usage could multiply by the number of GPUs used.

ricard-borras-veriff commented 1 year ago

Hi

all the GPUS are the same type in all pods. Regarding model configuration files, I only specify KINDCPU for the python model, ONNX model is leaved without any KIND flag

thanks

On Mon, Mar 20, 2023 at 5:00 PM Iman Tabrizian @.***> wrote:

I meant the same GPU type. By any chance is it possible that multiple GPUs are exposed to the pod in one case and single GPU in the other case? What is the model configuration of the Python and ONNX model? If you are using KIND_GPU, triton will create one model instance for each GPU so if the pod is scheduled on a multi-GPU system the memory usage could multiply by the number of GPUs used.

— Reply to this email directly, view it on GitHub https://github.com/triton-inference-server/server/issues/5513#issuecomment-1476505626, or unsubscribe https://github.com/notifications/unsubscribe-auth/A4BWMD22SKVVPDM3Q5YTALLW5B5KHANCNFSM6AAAAAAV5OIHG4 . You are receiving this because you authored the thread.Message ID: @.***>

--

Ricard Borras

Senior Machine Learning Engineer

@.*** Av Diagonal 123, 9th floor, Barcelona - Spain

triton-inference-server / server

High GPU consumption when deploying into Kubernetes #5513