This template is for Zabbix to monitor multiple NVidia GPUs
This template uses only one user parameter, receives all parameters in one request and requires no additional scripts
This template is set up and tested on a server with nine Nvidia graphics cards. Comments, suggestions and help to improve this template are welcome
Vladimir Eliseev
There are no macros links in this template.
There are no template links in this template.
Name | Description | Type | Key and additional info |
---|---|---|---|
GPU Data | Data collection by GPUs |
SNMP agent |
gpu.data Update: 1m |
Common Items | Name | Description | Type | Key and additional info |
---|---|---|---|---|
GPU Count | Number of GPUs detected |
Dependent items |
gpu.count | |
GPU Driver Version | GPU driver version |
Dependent items |
gpu.driver_version | |
GPU Power Total | Power consumption of all GPUs |
Dependent items |
gpu.power_total | |
GPUs Maximum Temperature | Temperature of the hottest GPU |
Dependent items |
gpu.temp_max | |
GPU Utilization Total | Total GPU utilisation |
Dependent items |
gpu.utilization_total |
Items for each GPU found | Name | Description | Type | Key and additional info |
---|---|---|---|---|
GPU Power | Power consumption of the GPU | Dependent items |
gpu.power | |
GPU Total Memory | GPU memory capacity | Dependent items |
gpu.mtotal | |
GPU Used Memory | The amount of GPU memory used | Dependent items |
gpu.mused | |
GPU Free Memory | Amount of free GPU memory | Dependent items |
gpu.mfree | |
GPU Utilisation | GPU utilisation | Dependent items |
gpu.utilization | |
GPU Temperature | GPU Temperature | Dependent items |
gpu.temperature | |
GPU Fan Speed | GPU Fan Speed | Dependent items |
gpu.fan |
Name | Description | Expression | Priority |
---|---|---|---|
Driver version changed | The driver version has changed | change(/Nvidia Multi-GPU/gpu.driver_version)<>0 |
Information |
GPU {#ID} Temperature is extremely high | The temperature of the GPU is very high. Possibility of failure | last(/Nvidia Multi-GPU/gpu.temperature.[{#ID}])>=80 | Disaster |
GPU {#ID} Temperature is high | Temperature of the graphics processor is high | last(/Nvidia Multi-GPU/gpu.temperature.[{#ID}])>=65 Dependencies: GPU {#ID} Temperature is extremely high |
Average |
Problem with the fan | Fan does not spin when GPU is hot | last(/Nvidia Multi-GPU/gpu.fan.[{#ID}])=0 and last(/Nvidia Multi-GPU/gpu.temperature.[{#ID}])>60 | Disaster |