sustainable-computing-io / kepler-doc

Kepler uses eBPF to probe energy related system stats and exports as Prometheus metrics
https://sustainable-computing.io/
Apache License 2.0
13 stars 37 forks source link

GPU Nvidia H100 PCIe Not Supported #135

Closed sbiaudet closed 6 months ago

sbiaudet commented 6 months ago

I've deployed Kepler on a Kubernetes to monitor a cluster with a GPU node with a NIVIDIA H100 PCIe.

In the kepler logs from this node, I've this error. In parallel I'm monitoring this GPU with a dgcm-exporter instance and it can collect gpu energy consumption metrics correctly.

I0125 07:04:48.972351 1 power.go:86] Failed to collect GPU metrics, trying to initizalize again: failed to get processes' utilization on device {0x7f639b40bdf8}: Not Supported I0125 07:04:48.972407 1 gpu_nvml.go:62] found 1 gpu devices I0125 07:04:48.972416 1 gpu_nvml.go:73] GPU 0 NVIDIA H100 PCIe

Do you have an idea ?

sbiaudet commented 6 months ago

Bad repository sorry. I move it to Kepler repository