Add machine level metrics for NVIDIA GPUs

google / cadvisor

Analyzes resource usage and performance characteristics of running containers.

Other

16.95k stars 2.31k forks source link

Add machine level metrics for NVIDIA GPUs #1842

Open cmluciano opened 6 years ago

cmluciano commented 6 years ago

The underlying library provides PowerUsage metrics. I'd like to collect these and expose them in a similar fashion as the container metrics for GPUs. Would a PR be accepted to add metrics for PowerUsage? Also related would be GPU temperature, and I'd have to add that first to the gonvml library.

cc @mindprince @dashpole

rohitagarwal003 commented 6 years ago

I had added power usage in the first iteration of my initial PR. But then we decided to add only the bare minimum metrics in the first version and wait for user feedback. NVML exposes a lot of metrics (power usage, temperature, fan speed etc.) but it's not clear how helpful these metrics are to users running GPU workloads.

While testing, I also noticed that power usage graph was exactly the same as the duty_cycle graph.

TuranTimur commented 6 years ago

any update on this topic?

cmluciano commented 6 years ago

@donghwicha Which additional metrics are you interested in ?

TuranTimur commented 6 years ago

I already implemented it by myself. will send pr soon.

pineking commented 6 years ago

@donghwicha do you have the update or pr?

TuranTimur commented 6 years ago

@pineking I'm sorry but I'm too busy with my project. Based on my experience, it shouldn't be hard to implement it and just coding following what's implemented was enough.

dfredell commented 6 years ago

I would enjoy metrics for the nvidia GPUs for processes not in a docker container. Right now we have GPU services running in the host Ubuntu OS. I really just want GPU load, for hardware forecasting, and viewing system usage. Maybe this comment belongs in a new issue.

rohitagarwal003 commented 6 years ago

@dfredell If you only want machine level GPU metrics, the "correct" way is to write a GPU prometheus exporter. It should be pretty easy to build using https://github.com/mindprince/gonvml but I haven't gotten around to doing it yet.

dfredell commented 6 years ago

@mindprince Thanks. Upon more research I learned that cadvisor is more designed to monitor docker. I found https://github.com/tankbusta/nvidia_exporter and https://github.com/prometheus/node_exporter which better fits my use case and needs.