Triton uses the library version of DCGM which is not allowed to co-exist with the container version of DCGM
I'm using Google Kubernetes Engine and followed this page to try running nvidia-dcgm and nvidia-dcgm-exporter pods (DaemonSet).
It turned out Triton metrics worked even with the separate DCGM in another pod.
Based on this observation, I have several questions.
Does the issue where DCGM can't coexist still exist?
If it still exists, is there a way to make sure Triton metrics are available even with a separate DCGM, without rebuilding Triton to disable DCGM? As we have some GPU workloads that are not using Triton, having a separate DCGM is necessary for the workloads.
In the release note of 24.08, there is a known issue which is
Also I found https://github.com/triton-inference-server/server/issues/3897#issuecomment-1035414009 which had said,
I'm using Google Kubernetes Engine and followed this page to try running
nvidia-dcgm
andnvidia-dcgm-exporter
pods (DaemonSet). It turned out Triton metrics worked even with the separate DCGM in another pod.Triton metrics
nv_gpu_utilization{gpu_uuid="GPU-c1eb1e78-d69a-c334-9ce5-2cec7c8399a1"} 0
DCGM exporter metrics
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-c1eb1e78-d69a-c334-9ce5-2cec7c8399a1",device="nvidia0",modelName="Tesla T4",Hostname="gke-test-dcgm-default-pool-270c4b72-ct4k",container="triton-server",namespace="default",pod="triton-86c854b54b-84l8q"} 0
Based on this observation, I have several questions.
Tested versions