plambe / zabbix-nvidia-smi-multi-gpu

A zabbix template using nvidia-smi. Works with multiple GPUs on Windows and Linux.
Other
94 stars 53 forks source link

Unable to discover when swapped to Agent Active #17

Closed bkgrant closed 3 years ago

bkgrant commented 3 years ago

Added the template and required scripts/ config. I have changed all the Types of Prototypes to Zabbix Agent (active) and am getting this error in the Zabbix UI.

Invalid discovery rule value: cannot parse as a valid JSON object: invalid object format, expected opening character '{' or '[' at: 'The syntax of the command is incorrect. C:\windows\system32><!DOCTYPE html>'

Could be me just being silly though, who knows.

plambe commented 3 years ago

Hi, I see in your output "DOCTYPE html" which makes me think that maybe you downloaded something as a web page instead of the raw code. Like downloading this: https://github.com/plambe/zabbix-nvidia-smi-multi-gpu/blob/master/get_gpus_info.bat Instead of downloading this: https://raw.githubusercontent.com/plambe/zabbix-nvidia-smi-multi-gpu/master/get_gpus_info.bat

bkgrant commented 3 years ago

You sire @plambe are one smart cookie. I am using Saltstack to deploy and didn't grab the raw content from your repo.

Thanks and sorry for the silly question.

Thoughts on where to debug no data coming through for a Zabbix Agent (active) image

plambe commented 3 years ago

Lol, glad to hear that!

On the zabbix agent thing - I'd start by enabling local debugging for the zabbix agent, on the machine it's located at. Take a look at for example: https://www.zabbix.com/documentation/current/manual/concepts/agent#agent_on_windows_systems

plambe commented 3 years ago

Also maybe: https://www.zabbix.com/documentation/current/manual/appendix/config/zabbix_agentd_win

bkgrant commented 3 years ago

Cool been trawling through for a while with full debugging. No logs on anything other than gpu.discovery or gpu.number. Example:

9924:20210827:164339.798 EXECUTE_STR() command:'"C:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe" -L | find /c /v ""' len:1 cmd_result:'1'
  9924:20210827:164339.798 for key [gpu.number] received value [1]
  9924:20210827:164339.798 In process_value() key:'{ourdnsformachine}:gpu.number' lastlogsize:null value:'1'
  9924:20210827:164339.799 buffer: new element 0
  9924:20210827:164339.799 End of process_value():SUCCEED
  9924:20210827:164339.799 In need_meta_update() key:gpu.number
  9924:20210827:164339.800 End of need_meta_update():FAIL

Not really sure if the active checks are even being sent to the agent for some reason

plambe commented 3 years ago

No logs on anything other than gpu.discovery or gpu.number.

This makes me think you're correct that the checks are not even sent to the agent. I'm out of ideas at the moment and it's Friday evening around here, so I'm going out. I'll think about it. Keep me updated.

@RichardKav having solutions up their sleeve has happened before.

One last thing - the template xml was updated very recently via a pull request by someone I don't know and I'm currently unable to test it. Maybe you can try the previous version and report back. This https://github.com/plambe/zabbix-nvidia-smi-multi-gpu/blob/e67e323b24376e1639d26406fe187b5be38163f9/zbx_nvidia-smi-multi-gpu.xml

RichardKav commented 3 years ago

Hi @plambe, @bkgrant I'll see what springs to mind.

Looking around "End of need_meta_update():FAIL" doesn't seem important, only that the agent has nothing new to send to the server.

I'd probably look at the logs further both server and agent side. Hopefully get a little more info. Some errors sometimes occur when the agent is in a VM etc when both agent and server can't resolve each others IPs correctly.

I'd also probably check to see what metrics are available for the agents in general. Maybe use a script to pull all the data out of the zabbix database for the agent. https://github.com/RichardKav/zabbix-data-collector.

It might also be worth going back to basics and check that nvida-smi is working as expected: e.g.

nvidia-smi --query-gpu=timestamp,index,name,uuid,serial,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,power.draw,power.limit,fan.speed,temperature.gpu,compute_mode,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video,gpu_operation_mode.current,pstate,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.sync_boost,clocks_throttle_reasons.unknown --format=csv,nounits -l 1

bkgrant commented 3 years ago

Seems to be an issue that has fixed itself after 3 days. Sat down to start extracting logs and noticed that the agent was all of a sudden collecting data. I will see if anything was changed and update this, in an attempt to assist future strugglers. Strangely though it seems to have just taken a weird amount of time....

RichardKav commented 3 years ago

@bkgrant It's most likely that once the agent had failed due to the previous issue that it went into a period of waiting before it tried again properly. Glad that you managed to get the issue resolved.

plambe commented 3 years ago

@bkgrant, thanks for the update. @RichardKav, thanks for weighing in on yet another issue here. It's appreciated!

bkgrant commented 3 years ago

While I have you awesome people here. Any thoughts on why gpu.utilization[0] and gpu.power[0] might be returning N/A, nothing jumping out at me in the logs

plambe commented 3 years ago

Does nvidia-smi return the proper values if you invoke it manually? Copying from @RichardKav's comment:

nvidia-smi --query-gpu=timestamp,index,name,uuid,serial,utilization.gpu,utilization.memory,memory.total,memory.free,memory.used,power.draw,power.limit,fan.speed,temperature.gpu,compute_mode,clocks.current.graphics,clocks.current.sm,clocks.current.memory,clocks.current.video,gpu_operation_mode.current,pstate,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.sync_boost,clocks_throttle_reasons.unknown --format=csv,nounits -l 1

bkgrant commented 3 years ago

Starts off with saying that Field "clocks_throttle_reasons.unknown" is not a valid field to query.

After removing that field it is returning the N/A as shown in the Zabbix Agent:

2021/08/31 11:37:40.983, 0, NVIDIA GeForce RTX 3080, GPU-8fe3a97d-51f6-a669-0d80-b4530b81b816, [N/A], [N/A], [N/A], 10240, 9029, 1211, [N/A], [N/A], 0, 54, Default, [N/A], [N/A], [N/A], [N/A], [N/A], P3, [N/A], [N/A], [N/A], [N/A], [N/A], [N/A]
RichardKav commented 3 years ago

Detailed power management isn't supported by all graphics cards.

The manual for Nvidia SMI states:

https://developer.download.nvidia.com/compute/DCGM/docs/nvidia-smi-367.38.pdf

"Power Management A flag that indicates whether power management is enabled. Either "Supported" or "N/A". Requires Inforom PWR object version 3.0 or higher or Kepler device."

Note also see: https://forums.developer.nvidia.com/t/nvidia-smi-390-48-power-management-object-update-or-enable/66498/4

To test run the command:

nvidia-smi --query-gpu=power.management --format=csv

bkgrant commented 3 years ago

Fair enough, thoughts on the utilization?

RichardKav commented 3 years ago

I suspect its related to this:

https://forums.developer.nvidia.com/t/gpu-memory-usage-shows-n-a/169140/4

bkgrant commented 3 years ago

Updated nvidia drivers to latest and issue is fixed