Open cmluciano opened 6 years ago
I had added power usage in the first iteration of my initial PR. But then we decided to add only the bare minimum metrics in the first version and wait for user feedback. NVML exposes a lot of metrics (power usage, temperature, fan speed etc.) but it's not clear how helpful these metrics are to users running GPU workloads.
While testing, I also noticed that power usage graph was exactly the same as the duty_cycle graph.
any update on this topic?
@donghwicha Which additional metrics are you interested in ?
I already implemented it by myself. will send pr soon.
@donghwicha do you have the update or pr?
@pineking I'm sorry but I'm too busy with my project. Based on my experience, it shouldn't be hard to implement it and just coding following what's implemented was enough.
I would enjoy metrics for the nvidia GPUs for processes not in a docker container. Right now we have GPU services running in the host Ubuntu OS. I really just want GPU load, for hardware forecasting, and viewing system usage. Maybe this comment belongs in a new issue.
@dfredell If you only want machine level GPU metrics, the "correct" way is to write a GPU prometheus exporter. It should be pretty easy to build using https://github.com/mindprince/gonvml but I haven't gotten around to doing it yet.
@mindprince Thanks. Upon more research I learned that cadvisor is more designed to monitor docker. I found https://github.com/tankbusta/nvidia_exporter and https://github.com/prometheus/node_exporter which better fits my use case and needs.
The underlying library provides PowerUsage metrics. I'd like to collect these and expose them in a similar fashion as the container metrics for GPUs. Would a PR be accepted to add metrics for PowerUsage? Also related would be GPU temperature, and I'd have to add that first to the gonvml library.
cc @mindprince @dashpole