plambe / zabbix-nvidia-smi-multi-gpu

A zabbix template using nvidia-smi. Works with multiple GPUs on Windows and Linux.
Other
99 stars 54 forks source link

Issue with a 16 GPUs server. #9

Closed metabsd closed 5 years ago

metabsd commented 6 years ago

image

metabsd commented 6 years ago

Can you help me to fix that please :)

plambe commented 6 years ago

I was on vacation :) so that's why I just saw this issue opened.

Anyway, I don't know what's wrong only by looking at your screenshot.

Did you already fix it? If not, can you give me the local logs from the zabbix-agent? You can get them by editing the conf file to change the verbosity and the log file path.

Do you only have the issue with the fan speeds?

RichardKav commented 6 years ago

If the issue still exists it might be worth looking at the raw output of nvidia-smi i.e.

nvidia-smi --query-gpu=fan.speed --format=csv,noheader,nounits -i 0

I suspect the "[Not Supported]" is the output from nvidia-smi and its causing a parse error.

metabsd commented 6 years ago

Hello, welcome back from vacation!

This is not a real problem but rather a misunderstanding on my part.

There is no FAN on this type of GPU :)

root@hostname:~# nvidia-smi --query-gpu=fan.speed --format=csv,noheader,nounits -i 0
[Not Supported]

Case Close

In another subject.

We add that config to userparameter_nvidia-smi.conf to have a metric with the average utilization of all GPU per server.

UserParameter=gpu.avg,nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader,nounits | /opt/local/bin/jq -s add/length | tr -d "\n"

Have a nice day!

RichardKav commented 6 years ago

One quick thing in regards to average utilisation of all GPUs per server is that given the original metrics definition it might not be very useful, depending on your purpose. At the very least it might not always react as expected. Nvidia-smi's definition of utilisation is:

unsigned int gpu - Percent of time over the past second during which one or more kernels was executing on the GPU.

It generally means that if the GPU is doing some work it will be either 0% or 100% usage (with occasional transitions in between) and that using 1 core in a GPU is the same as using all of them.

You can find this definition specified in the manual here.

metabsd commented 5 years ago

Thx!