openrca / orca

Root Cause Analysis for Kubernetes
https://openrca.io
Apache License 2.0
87 stars 10 forks source link

Add alert probe for Zabbix #9

Open bzurkowski opened 4 years ago

bzurkowski commented 4 years ago

Zabbix is an open-source monitoring software tool for diverse IT components, including networks, servers, virtual machines, and cloud services. It could be evaluated as a potential source for relevant telemetry data - triggers (alerts).

Prior to attempting implementation, we should collect some initial facts about Zabbix operation:

  1. Install Zabbix in a testbed.
  2. Analyze what infrastructure elements are monitored.
  3. Analyze what items (metrics) are collected.
  4. Configure sample triggers.
  5. Simulate scenarios in which triggers are executed (e.g. disk overloaded, CPU overcommitted).
  6. Determine how to obtain trigger information (via API or webhook?).
  7. Collect sample trigger payloads and store them in integrations directory.
  8. Analyze how (and if) we could map triggers to elements present in the entity graph based on the labeling provided in trigger payload.

After collecting the above information, we will decide whether Zabbix can be integrated into OpenRCA and what elements must be taken into account in the development effort.

aleksandra-galara commented 4 years ago

Hi, as for now, I've installed Zabbix 4.4 in Kubernetes and collected sample trigger payloads (using PyZabbix library to communicate with API).

Zabbix 4.4 gives us out-of-the-box templates for operating systems (Linux/Windows), databases (MySQL,PostgreSQL/Redis), network devices (Cisco, Dell, Hp, Juniper), servers and apps (apache, haproxy, nginx, rabbitmg) with 432 collected items and 219 triggers. There is also a template for integrating Zabbix with Prometheus and collecting data from Node Exporter instead of Zabbix agents.

I think that, for the beginning, we could focus on mapping the triggers from OS Linux Template:

Template Module Linux CPU by Zabbix agent:
- [ ] Load average is too high (per CPU load over {$LOAD_AVG_PER_CPU.MAX.WARN} for 5m)
- [ ] High CPU utilization (over {$CPU.UTIL.CRIT}% for 5m)

Template Module Linux filesystems by Zabbix agent:
- [ ] {#FSNAME}: Disk space is critically low (used > {$VFS.FS.PUSED.MAX.CRIT:"{#FSNAME}"}%)
- [ ] {#FSNAME}: Disk space is low (used > {$VFS.FS.PUSED.MAX.WARN:"{#FSNAME}"}%)
- [ ] {#FSNAME}: Running out of free inodes (free < {$VFS.FS.INODE.PFREE.MIN.CRIT:"{#FSNAME}"}%)
- [ ] {#FSNAME}: Running out of free inodes (free < {$VFS.FS.INODE.PFREE.MIN.WARN:"{#FSNAME}"}%)

Template Module Linux memory by Zabbix agent:
- [ ] High memory utilization ( >{$MEMORY.UTIL.MAX}% for 5m)
- [ ] Lack of available memory ( < {$MEMORY.AVAILABLE.MIN} of {ITEM.VALUE2})
- [ ] High swap space usage ( less than {$SWAP.PFREE.MIN.WARN}% free)   

Template Module Linux block devices by Zabbix agent:
- [ ] {#DEVNAME}: Disk read/write request responses are too high (read > {$VFS.DEV.READ.AWAIT.WARN:"{#DEVNAME}"} ms for 15m or write > {$VFS.DEV.WRITE.AWAIT.WARN:"{#DEVNAME}"} ms for 15m)

Template Module Linux network interfaces by Zabbix agent:
- [ ] Interface {#IFNAME}: High error rate ( > {$IF.ERRORS.WARN:"{#IFNAME}"} for 5m)
- [ ] Interface {#IFNAME}: Link down
- [ ] Interface {#IFNAME}: Ethernet has changed to lower speed than it was before

Template Module Linux generic by Zabbix agent:
- [ ] System time is out of sync (diff with Zabbix server > {$SYSTEM.FUZZYTIME.MAX}s)
- [ ] System name has changed (new name: {ITEM.VALUE})  
- [ ] Configured max number of open filedescriptors is too low (< {$KERNEL.MAXFILES.MIN})   
- [ ] Configured max number of processes is too low (< {$KERNEL.MAXPROC.MIN})   
- [ ] Getting closer to process limit (over 80% used)   
- [ ] Operating system description has changed  
- [ ] /etc/passwd has been changed  
- [ ] {HOST.NAME} has been restarted (uptime < 10m)

And maybe MySQL or Docker template (which came in release 5.0) could be useful for us?

aleksandra-galara commented 4 years ago

Here's sample trigger payloads obtained with params "output" and "selectHosts" set to "extend":

{
    "id": 1,
    "jsonrpc": "2.0",
    "result": [
        {
            "comments": "For passive only agents, host availability is used with {$AGENT.TIMEOUT} as time threshold.",
            "correlation_mode": "0",
            "correlation_tag": "",
            "description": "Zabbix agent is not available (for 3m)",
            "error": "",
            "expression": "{19665}=0",
            "flags": "0",
            "hosts": [
                {
                    "auto_compress": "1",
                    "available": "2",
                    "description": "",
                    "disable_until": "1589282335",
                    "error": "Get value from agent failed: cannot connect to [[10.233.92.15]:10050]: [4] Interrupted system call",
                    "errors_from": "1589189044",
                    "flags": "0",
                    "host": "zabbix-agent-cw9bn",
                    "hostid": "10318",
                    "inventory_mode": "-1",
                    "ipmi_authtype": "-1",
                    "ipmi_available": "0",
                    "ipmi_disable_until": "0",
                    "ipmi_error": "",
                    "ipmi_errors_from": "0",
                    "ipmi_password": "",
                    "ipmi_privilege": "2",
                    "ipmi_username": "",
                    "jmx_available": "0",
                    "jmx_disable_until": "0",
                    "jmx_error": "",
                    "jmx_errors_from": "0",
                    "lastaccess": "0",
                    "maintenance_from": "0",
                    "maintenance_status": "0",
                    "maintenance_type": "0",
                    "maintenanceid": "0",
                    "name": "zabbix-agent-cw9bn",
                    "proxy_address": "",
                    "proxy_hostid": "0",
                    "snmp_available": "0",
                    "snmp_disable_until": "0",
                    "snmp_error": "",
                    "snmp_errors_from": "0",
                    "status": "0",
                    "templateid": "0",
                    "tls_accept": "1",
                    "tls_connect": "1",
                    "tls_issuer": "",
                    "tls_psk": "",
                    "tls_psk_identity": "",
                    "tls_subject": ""
                }
            ],
            "lastchange": "1589189295",
            "manual_close": "1",
            "opdata": "",
            "priority": "3",
            "recovery_expression": "",
            "recovery_mode": "0",
            "state": "0",
            "status": "0",
            "templateid": "16198",
            "triggerid": "16817",
            "type": "0",
            "url": "",
            "value": "1"
        },
        {
            "comments": "Per CPU load average is too high. Your system may be slow to respond.",
            "correlation_mode": "0",
            "correlation_tag": "",
            "description": "Load average is too high (per CPU load over 1.5 for 5m)",
            "error": "",
            "expression": "{19277}/{19278}>{$LOAD_AVG_PER_CPU.MAX.WARN}\r\nand {19279}>0\r\nand {19280}>0",
            "flags": "0",
            "hosts": [
                {
                    "auto_compress": "1",
                    "available": "1",
                    "description": "",
                    "disable_until": "0",
                    "error": "",
                    "errors_from": "0",
                    "flags": "0",
                    "host": "zabbix-agent-96x6t",
                    "hostid": "10084",
                    "inventory_mode": "-1",
                    "ipmi_authtype": "-1",
                    "ipmi_available": "0",
                    "ipmi_disable_until": "0",
                    "ipmi_error": "",
                    "ipmi_errors_from": "0",
                    "ipmi_password": "",
                    "ipmi_privilege": "2",
                    "ipmi_username": "",
                    "jmx_available": "0",
                    "jmx_disable_until": "0",
                    "jmx_error": "",
                    "jmx_errors_from": "0",
                    "lastaccess": "0",
                    "maintenance_from": "0",
                    "maintenance_status": "0",
                    "maintenance_type": "0",
                    "maintenanceid": "0",
                    "name": "zabbix-agent-96x6t",
                    "proxy_address": "",
                    "proxy_hostid": "0",
                    "snmp_available": "0",
                    "snmp_disable_until": "0",
                    "snmp_error": "",
                    "snmp_errors_from": "0",
                    "status": "0",
                    "templateid": "0",
                    "tls_accept": "1",
                    "tls_connect": "1",
                    "tls_issuer": "",
                    "tls_psk": "",
                    "tls_psk_identity": "",
                    "tls_subject": ""
                }
            ],
            "lastchange": "1589282230",
            "manual_close": "0",
            "opdata": "Load averages(1m 5m 15m): ({ITEM.LASTVALUE1} {ITEM.LASTVALUE3} {ITEM.LASTVALUE4}), # of CPUs: {ITEM.LASTVALUE2}",
            "priority": "3",
            "recovery_expression": "",
            "recovery_mode": "0",
            "state": "0",
            "status": "0",
            "templateid": "16564",
            "triggerid": "16565",
            "type": "0",
            "url": "",
            "value": "1"
        }
    ]
}