giDai7ja / zabbix_nvidia_multi_gpu

Zabbix template for monitoring multiple Nvidia GPUs
GNU General Public License v3.0
2 stars 1 forks source link

zbx_NVidia_GPUs.yaml for the zabbix 6.0 version #1

Open chenjiuhai opened 1 year ago

chenjiuhai commented 1 year ago

Updated:

  1. add the tags: Application and value Nvidia.
  2. update the type 'item' to 'ITEM' for zabbix server 6.0 version.
zabbix_export:
  version: '6.0'
  date: '2023-07-04T19:56:50Z'
  groups:
    - uuid: e960332b3f6c46a1956486d4f3f99fce
      name: 'Templates/Server hardware'
  templates:
    - uuid: 0d0a24bfd8c046038be50e02b0203765
      template: 'Nvidia Multi-GPU'
      name: 'Nvidia Multi-GPU'
      groups:
        - name: 'Templates/Server hardware'
      items:
        - uuid: a6099bd5ed6a49b8a0b8d76064931161
          name: 'GPU Count'
          type: DEPENDENT
          key: gpu.count
          delay: '0'
          history: 30d
          trends: 90d
          description: 'Number of GPUs detected'
          preprocessing:
            - type: JSONPATH
              parameters:
                - '$[*].1.length()'
          master_item:
            key: gpu.data
          tags:
            -
              tag: Application
              value: Nvidia
        - uuid: 650b6faad048440da328b100a72f496c
          name: 'GPU Data'
          key: gpu.data
          history: 1h
          trends: '0'
          value_type: TEXT
          description: 'Data collection by GPUs'
          preprocessing:
            - type: CSV_TO_JSON
              parameters:
                - ','
                - '"'
                - '0'
          tags:
            -
              tag: Application
              value: Nvidia          
        - uuid: 8d3fe49bb34048afa8ed79502db34cf1
          name: 'GPU Driver Version'
          type: DEPENDENT
          key: gpu.driver_version
          delay: '0'
          history: 30d
          trends: '0'
          value_type: TEXT
          description: 'GPU driver version'
          inventory_link: SOFTWARE
          preprocessing:
            - type: JSONPATH
              parameters:
                - '$[0].10'
          master_item:
            key: gpu.data
          triggers:
            - uuid: 226db5338016428ba86d2a8fa5f31da2
              expression: 'change(/Nvidia Multi-GPU/gpu.driver_version)<>0'
              name: 'Driver version changed'
              priority: INFO
              description: 'The driver version has changed'
          tags:
            -
              tag: Application
              value: Nvidia
        - uuid: 2342ff31e369484c90d63210c58eb77e
          name: 'GPU Power Total'
          type: DEPENDENT
          key: gpu.power_total
          delay: '0'
          history: 30d
          trends: 90d
          value_type: FLOAT
          units: W
          description: 'Power consumption of all GPUs'
          preprocessing:
            - type: STR_REPLACE
              parameters:
                - ' '
                - ''
            - type: JSONPATH
              parameters:
                - '$[*].8.sum()'
          master_item:
            key: gpu.data
          tags:
            -
              tag: Application
              value: Nvidia
        - uuid: 8401ed7ac3a249ea92515b41fd7cc68a
          name: 'GPUs Maximum Temperature'
          type: DEPENDENT
          key: gpu.temp_max
          delay: '0'
          history: 30d
          trends: 90d
          units: °C
          description: 'Temperature of the hottest GPU'
          preprocessing:
            - type: STR_REPLACE
              parameters:
                - ' '
                - ''
            - type: JSONPATH
              parameters:
                - '$[*].2.max()'
          master_item:
            key: gpu.data
          tags:
            -
              tag: Application
              value: Nvidia
        - uuid: 1d16ecd4f52b464a812dfc500a2a9f34
          name: 'GPU Utilization Total'
          type: DEPENDENT
          key: gpu.utilization_total
          delay: '0'
          history: 30d
          trends: 90d
          value_type: FLOAT
          units: '%'
          description: 'Total GPU utilisation'
          preprocessing:
            - type: STR_REPLACE
              parameters:
                - ' '
                - ''
            - type: JSONPATH
              parameters:
                - '$[*].7.avg()'
          master_item:
            key: gpu.data
          tags:
            -
              tag: Application
              value: Nvidia
      discovery_rules:
        - uuid: f9f116239737447b8ddfa2c558d8e583
          name: 'GPU Card'
          type: DEPENDENT
          key: gpu.id
          delay: '0'
          lifetime: 1d
          item_prototypes:
            - uuid: bcb70b06d9db454bae4fc0aa0d01f210
              name: 'GPU {#ID} Fan Speed -{#NAME}'
              type: DEPENDENT
              key: 'gpu.fan.[{#ID}]'
              delay: '0'
              history: 30d
              trends: 90d
              value_type: FLOAT
              units: '%'
              description: 'GPU Fan Speed'
              preprocessing:
                - type: JSONPATH
                  parameters:
                    - '$[?(@.1 == ''{#ID}'')].6'
                - type: JSONPATH
                  parameters:
                    - '$[0]'
              master_item:
                key: gpu.data
              tags:
                -
                  tag: Application
                  value: Nvidia
            - uuid: a0601fa235114c8da7854df368dcbb91
              name: 'GPU {#ID} Memory Free -{#NAME}'
              type: DEPENDENT
              key: 'gpu.mfree.[{#ID}]'
              delay: '0'
              history: 30d
              trends: 90d
              units: B
              description: 'Amount of free GPU memory'
              preprocessing:
                - type: JSONPATH
                  parameters:
                    - '$[?(@.1 == ''{#ID}'')].5'
                - type: JSONPATH
                  parameters:
                    - '$[0]'
                - type: MULTIPLIER
                  parameters:
                    - '1048576'
              master_item:
                key: gpu.data
              tags:
                -
                  tag: Application
                  value: Nvidia
            - uuid: da856164b416405591e4eefacb8d081a
              name: 'GPU {#ID} Memory Total -{#NAME}'
              type: DEPENDENT
              key: 'gpu.mtotal.[{#ID}]'
              delay: '0'
              history: 30d
              trends: 90d
              units: B
              description: 'GPU memory capacity'
              preprocessing:
                - type: JSONPATH
                  parameters:
                    - '$[?(@.1 == ''{#ID}'')].3'
                - type: JSONPATH
                  parameters:
                    - '$[0]'
                - type: MULTIPLIER
                  parameters:
                    - '1048576'
              master_item:
                key: gpu.data
              tags:
                -
                  tag: Application
                  value: Nvidia
            - uuid: 7d2c35e7c6b64ffc8acd4e49bb4fff4c
              name: 'GPU {#ID} Memory Used -{#NAME}'
              type: DEPENDENT
              key: 'gpu.mused.[{#ID}]'
              delay: '0'
              history: 30d
              trends: 90d
              units: B
              description: 'The amount of GPU memory used'
              preprocessing:
                - type: JSONPATH
                  parameters:
                    - '$[?(@.1 == ''{#ID}'')].4'
                - type: JSONPATH
                  parameters:
                    - '$[0]'
                - type: MULTIPLIER
                  parameters:
                    - '1048576'
              master_item:
                key: gpu.data
              tags:
                -
                  tag: Application
                  value: Nvidia
            - uuid: 3667831f6c834e3f8b43f5ea86d2bb65
              name: 'GPU {#ID} Power -{#NAME}'
              type: DEPENDENT
              key: 'gpu.power.[{#ID}]'
              delay: '0'
              history: 30d
              trends: 90d
              value_type: FLOAT
              units: W
              description: 'Power consumption of the GPU'
              preprocessing:
                - type: JSONPATH
                  parameters:
                    - '$[?(@.1 == ''{#ID}'')].8'
                - type: JSONPATH
                  parameters:
                    - '$[0]'
              master_item:
                key: gpu.data
              tags:
                -
                  tag: Application
                  value: Nvidia
            - uuid: 242afcbf5fb34b05bd0043c2cb703f2e
              name: 'GPU {#ID} Temperature -{#NAME}'
              type: DEPENDENT
              key: 'gpu.temperature.[{#ID}]'
              delay: '0'
              history: 30d
              trends: 90d
              value_type: FLOAT
              units: °C
              description: 'GPU Temperature'
              preprocessing:
                - type: JSONPATH
                  parameters:
                    - '$[?(@.1 == ''{#ID}'')].2'
                - type: JSONPATH
                  parameters:
                    - '$[0]'
              master_item:
                key: gpu.data
              tags:
                -
                  tag: Application
                  value: Nvidia
              trigger_prototypes:
                - uuid: 5929b2f1ceba49b6ba6bfdd623850c7c
                  expression: 'last(/Nvidia Multi-GPU/gpu.temperature.[{#ID}])>=80'
                  name: 'GPU {#ID} Temperature is extremely high'
                  priority: DISASTER
                  description: 'The temperature of the GPU is very high. Possibility of failure'
                - uuid: 9beb7dd56a644d63b48fd7604e57a233
                  expression: 'last(/Nvidia Multi-GPU/gpu.temperature.[{#ID}])>=65'
                  name: 'GPU {#ID} Temperature is high'
                  priority: AVERAGE
                  description: 'Temperature of the graphics processor is high'
                  dependencies:
                    - name: 'GPU {#ID} Temperature is extremely high'
                      expression: 'last(/Nvidia Multi-GPU/gpu.temperature.[{#ID}])>=80'
            - uuid: a234ea02aba3420a84e55ba62185572d
              name: 'GPU {#ID} Utilization -{#NAME}'
              type: DEPENDENT
              key: 'gpu.utilization.[{#ID}]'
              delay: '0'
              history: 30d
              trends: 90d
              units: '%'
              description: 'GPU utilisation'
              preprocessing:
                - type: JSONPATH
                  parameters:
                    - '$[?(@.1 == ''{#ID}'')].7'
                - type: JSONPATH
                  parameters:
                    - '$[0]'
              master_item:
                key: gpu.data
              tags:
                -
                  tag: Application
                  value: Nvidia
          trigger_prototypes:
            - uuid: 3bc18bab7d744b9d83de72130289562a
              expression: 'last(/Nvidia Multi-GPU/gpu.fan.[{#ID}])=0 and last(/Nvidia Multi-GPU/gpu.temperature.[{#ID}])>60'
              name: 'Problem with the fan'
              priority: DISASTER
              description: 'Fan does not spin when GPU is hot'
          graph_prototypes:
            - uuid: 0e301af6d1ad469a83d8ff8d215e68bc
              name: 'GPU {#ID} Memory -{#NAME}'
              show_work_period: 'NO'
              show_triggers: 'NO'
              type: STACKED
              ymin_type_1: FIXED
              graph_items:
                - color: F63100
                  calc_fnc: MAX
                  item:
                    host: 'Nvidia Multi-GPU'
                    key: 'gpu.mused.[{#ID}]'
                - sortorder: '1'
                  color: 199C0D
                  calc_fnc: MIN
                  item:
                    host: 'Nvidia Multi-GPU'
                    key: 'gpu.mfree.[{#ID}]'
          master_item:
            key: gpu.data
          lld_macro_paths:
            - lld_macro: '{#ID}'
              path: $.1
            - lld_macro: '{#NAME}'
              path: $.9
      tags:
        - tag: Application
          value: Nvidia
      dashboards:
        - uuid: 04cb54bfd57042b98f3524f8d421749f
          name: 'GPU Panel'
          pages:
            - name: GPU
              widgets:
                - type: ITEM
                  x: '6'
                  width: '4'
                  hide_header: 'YES'
                  fields:
                    - type: INTEGER
                      name: show
                      value: '1'
                    - type: INTEGER
                      name: show
                      value: '2'
                    - type: INTEGER
                      name: show
                      value: '4'
                    - type: INTEGER
                      name: adv_conf
                      value: '1'
                    - type: INTEGER
                      name: decimal_places
                      value: '0'
                    - type: INTEGER
                      name: value_size
                      value: '60'
                    - type: INTEGER
                      name: units_size
                      value: '60'
                    - type: STRING
                      name: thresholds.color.0
                      value: FFF59D
                    - type: STRING
                      name: thresholds.threshold.0
                      value: '50'
                    - type: STRING
                      name: thresholds.color.1
                      value: FFA726
                    - type: STRING
                      name: thresholds.threshold.1
                      value: '75'
                    - type: STRING
                      name: thresholds.color.2
                      value: D32F2F
                    - type: STRING
                      name: thresholds.threshold.2
                      value: '90'
                    - type: ITEM
                      name: itemid
                      value:
                        host: 'Nvidia Multi-GPU'
                        key: gpu.utilization_total
                - type: ITEM
                  x: '10'
                  width: '5'
                  hide_header: 'YES'
                  fields:
                    - type: INTEGER
                      name: show
                      value: '1'
                    - type: INTEGER
                      name: show
                      value: '2'
                    - type: INTEGER
                      name: show
                      value: '4'
                    - type: INTEGER
                      name: adv_conf
                      value: '1'
                    - type: INTEGER
                      name: decimal_places
                      value: '0'
                    - type: INTEGER
                      name: value_size
                      value: '60'
                    - type: INTEGER
                      name: units_size
                      value: '60'
                    - type: STRING
                      name: up_color
                      value: FF0000
                    - type: STRING
                      name: down_color
                      value: 00FF00
                    - type: STRING
                      name: updown_color
                      value: FFFF00
                    - type: STRING
                      name: description
                      value: 'The hottest GPU'
                    - type: STRING
                      name: thresholds.color.0
                      value: FFF176
                    - type: STRING
                      name: thresholds.threshold.0
                      value: '60'
                    - type: STRING
                      name: thresholds.color.1
                      value: FFD54F
                    - type: STRING
                      name: thresholds.threshold.1
                      value: '65'
                    - type: STRING
                      name: thresholds.color.2
                      value: FFA726
                    - type: STRING
                      name: thresholds.threshold.2
                      value: '70'
                    - type: STRING
                      name: thresholds.color.3
                      value: FF7043
                    - type: STRING
                      name: thresholds.threshold.3
                      value: '75'
                    - type: STRING
                      name: thresholds.color.4
                      value: F44336
                    - type: STRING
                      name: thresholds.threshold.4
                      value: '80'
                    - type: STRING
                      name: thresholds.color.5
                      value: FF0000
                    - type: STRING
                      name: thresholds.threshold.5
                      value: '85'
                    - type: ITEM
                      name: itemid
                      value:
                        host: 'Nvidia Multi-GPU'
                        key: gpu.temp_max
                - type: ITEM
                  name: 'Driver Version'
                  x: '18'
                  width: '6'
                  hide_header: 'YES'
                  fields:
                    - type: INTEGER
                      name: show
                      value: '1'
                    - type: INTEGER
                      name: show
                      value: '2'
                    - type: INTEGER
                      name: adv_conf
                      value: '1'
                    - type: ITEM
                      name: itemid
                      value:
                        host: 'Nvidia Multi-GPU'
                        key: gpu.driver_version
                - type: ITEM
                  name: 'GPU Count'
                  x: '15'
                  width: '3'
                  hide_header: 'YES'
                  fields:
                    - type: INTEGER
                      name: decimal_places
                      value: '0'
                    - type: INTEGER
                      name: units_size
                      value: '60'
                    - type: INTEGER
                      name: show
                      value: '1'
                    - type: INTEGER
                      name: show
                      value: '2'
                    - type: INTEGER
                      name: adv_conf
                      value: '1'
                    - type: INTEGER
                      name: value_size
                      value: '60'
                    - type: ITEM
                      name: itemid
                      value:
                        host: 'Nvidia Multi-GPU'
                        key: gpu.count
                - type: GRAPH_PROTOTYPE
                  name: 'GPU MEMORY'
                  x: '12'
                  'y': '2'
                  width: '12'
                  height: '27'
                  fields:
                    - type: INTEGER
                      name: rows
                      value: '9'
                    - type: INTEGER
                      name: show_legend
                      value: '0'
                    - type: INTEGER
                      name: columns
                      value: '1'
                    - type: GRAPH_PROTOTYPE
                      name: graphid
                      value:
                        host: 'Nvidia Multi-GPU'
                        name: 'GPU {#ID} Memory -{#NAME}'
                - type: GRAPH_PROTOTYPE
                  name: 'GPU Utilization'
                  'y': '2'
                  width: '12'
                  height: '27'
                  fields:
                    - type: INTEGER
                      name: source_type
                      value: '3'
                    - type: INTEGER
                      name: rows
                      value: '9'
                    - type: INTEGER
                      name: show_legend
                      value: '0'
                    - type: INTEGER
                      name: columns
                      value: '1'
                    - type: ITEM_PROTOTYPE
                      name: itemid
                      value:
                        host: 'Nvidia Multi-GPU'
                        key: 'gpu.utilization.[{#ID}]'
                - type: ITEM
                  name: 'Power Total'
                  width: '6'
                  hide_header: 'YES'
                  fields:
                    - type: INTEGER
                      name: show
                      value: '1'
                    - type: INTEGER
                      name: show
                      value: '2'
                    - type: INTEGER
                      name: adv_conf
                      value: '1'
                    - type: INTEGER
                      name: decimal_places
                      value: '0'
                    - type: INTEGER
                      name: show
                      value: '4'
                    - type: INTEGER
                      name: value_size
                      value: '60'
                    - type: INTEGER
                      name: units_size
                      value: '60'
                    - type: STRING
                      name: up_color
                      value: FF0000
                    - type: STRING
                      name: down_color
                      value: 00FF00
                    - type: STRING
                      name: updown_color
                      value: FFFF00
                    - type: ITEM
                      name: itemid
                      value:
                        host: 'Nvidia Multi-GPU'
                        key: gpu.power_total
giDai7ja commented 1 year ago

Thank you for your interest!

  1. The tag you suggest is assigned to the whole template, the elements inherit it
  2. The template is presented as it was exported. The word case has not been changed manually. I am using Zabbix version 6.4.4. ps I suggested this template to put in zabbix/community-templates and I plan to make further changes and fixes there. I plan to close this repository