I've deployed Kepler on a Kubernetes to monitor a cluster with a GPU node with a NIVIDIA H100 PCIe.
In the kepler logs from this node, I've this error. In parallel I'm monitoring this GPU with a dgcm-exporter instance and it can collect gpu energy consumption metrics correctly.
I0125 07:04:48.972351 1 power.go:86] Failed to collect GPU metrics, trying to initizalize again: failed to get processes' utilization on device {0x7f639b40bdf8}: Not Supported
I0125 07:04:48.972407 1 gpu_nvml.go:62] found 1 gpu devices
I0125 07:04:48.972416 1 gpu_nvml.go:73] GPU 0 NVIDIA H100 PCIe
I've deployed Kepler on a Kubernetes to monitor a cluster with a GPU node with a NIVIDIA H100 PCIe.
In the kepler logs from this node, I've this error. In parallel I'm monitoring this GPU with a dgcm-exporter instance and it can collect gpu energy consumption metrics correctly.
I0125 07:04:48.972351 1 power.go:86] Failed to collect GPU metrics, trying to initizalize again: failed to get processes' utilization on device {0x7f639b40bdf8}: Not Supported I0125 07:04:48.972407 1 gpu_nvml.go:62] found 1 gpu devices I0125 07:04:48.972416 1 gpu_nvml.go:73] GPU 0 NVIDIA H100 PCIe
Do you have an idea ?