Open bzurkowski opened 4 years ago
Hi, as for now, I've installed Zabbix 4.4 in Kubernetes and collected sample trigger payloads (using PyZabbix library to communicate with API).
Zabbix 4.4 gives us out-of-the-box templates for operating systems (Linux/Windows), databases (MySQL,PostgreSQL/Redis), network devices (Cisco, Dell, Hp, Juniper), servers and apps (apache, haproxy, nginx, rabbitmg) with 432 collected items and 219 triggers. There is also a template for integrating Zabbix with Prometheus and collecting data from Node Exporter instead of Zabbix agents.
I think that, for the beginning, we could focus on mapping the triggers from OS Linux Template:
Template Module Linux CPU by Zabbix agent:
- [ ] Load average is too high (per CPU load over {$LOAD_AVG_PER_CPU.MAX.WARN} for 5m)
- [ ] High CPU utilization (over {$CPU.UTIL.CRIT}% for 5m)
Template Module Linux filesystems by Zabbix agent:
- [ ] {#FSNAME}: Disk space is critically low (used > {$VFS.FS.PUSED.MAX.CRIT:"{#FSNAME}"}%)
- [ ] {#FSNAME}: Disk space is low (used > {$VFS.FS.PUSED.MAX.WARN:"{#FSNAME}"}%)
- [ ] {#FSNAME}: Running out of free inodes (free < {$VFS.FS.INODE.PFREE.MIN.CRIT:"{#FSNAME}"}%)
- [ ] {#FSNAME}: Running out of free inodes (free < {$VFS.FS.INODE.PFREE.MIN.WARN:"{#FSNAME}"}%)
Template Module Linux memory by Zabbix agent:
- [ ] High memory utilization ( >{$MEMORY.UTIL.MAX}% for 5m)
- [ ] Lack of available memory ( < {$MEMORY.AVAILABLE.MIN} of {ITEM.VALUE2})
- [ ] High swap space usage ( less than {$SWAP.PFREE.MIN.WARN}% free)
Template Module Linux block devices by Zabbix agent:
- [ ] {#DEVNAME}: Disk read/write request responses are too high (read > {$VFS.DEV.READ.AWAIT.WARN:"{#DEVNAME}"} ms for 15m or write > {$VFS.DEV.WRITE.AWAIT.WARN:"{#DEVNAME}"} ms for 15m)
Template Module Linux network interfaces by Zabbix agent:
- [ ] Interface {#IFNAME}: High error rate ( > {$IF.ERRORS.WARN:"{#IFNAME}"} for 5m)
- [ ] Interface {#IFNAME}: Link down
- [ ] Interface {#IFNAME}: Ethernet has changed to lower speed than it was before
Template Module Linux generic by Zabbix agent:
- [ ] System time is out of sync (diff with Zabbix server > {$SYSTEM.FUZZYTIME.MAX}s)
- [ ] System name has changed (new name: {ITEM.VALUE})
- [ ] Configured max number of open filedescriptors is too low (< {$KERNEL.MAXFILES.MIN})
- [ ] Configured max number of processes is too low (< {$KERNEL.MAXPROC.MIN})
- [ ] Getting closer to process limit (over 80% used)
- [ ] Operating system description has changed
- [ ] /etc/passwd has been changed
- [ ] {HOST.NAME} has been restarted (uptime < 10m)
And maybe MySQL or Docker template (which came in release 5.0) could be useful for us?
Here's sample trigger payloads obtained with params "output" and "selectHosts" set to "extend":
{
"id": 1,
"jsonrpc": "2.0",
"result": [
{
"comments": "For passive only agents, host availability is used with {$AGENT.TIMEOUT} as time threshold.",
"correlation_mode": "0",
"correlation_tag": "",
"description": "Zabbix agent is not available (for 3m)",
"error": "",
"expression": "{19665}=0",
"flags": "0",
"hosts": [
{
"auto_compress": "1",
"available": "2",
"description": "",
"disable_until": "1589282335",
"error": "Get value from agent failed: cannot connect to [[10.233.92.15]:10050]: [4] Interrupted system call",
"errors_from": "1589189044",
"flags": "0",
"host": "zabbix-agent-cw9bn",
"hostid": "10318",
"inventory_mode": "-1",
"ipmi_authtype": "-1",
"ipmi_available": "0",
"ipmi_disable_until": "0",
"ipmi_error": "",
"ipmi_errors_from": "0",
"ipmi_password": "",
"ipmi_privilege": "2",
"ipmi_username": "",
"jmx_available": "0",
"jmx_disable_until": "0",
"jmx_error": "",
"jmx_errors_from": "0",
"lastaccess": "0",
"maintenance_from": "0",
"maintenance_status": "0",
"maintenance_type": "0",
"maintenanceid": "0",
"name": "zabbix-agent-cw9bn",
"proxy_address": "",
"proxy_hostid": "0",
"snmp_available": "0",
"snmp_disable_until": "0",
"snmp_error": "",
"snmp_errors_from": "0",
"status": "0",
"templateid": "0",
"tls_accept": "1",
"tls_connect": "1",
"tls_issuer": "",
"tls_psk": "",
"tls_psk_identity": "",
"tls_subject": ""
}
],
"lastchange": "1589189295",
"manual_close": "1",
"opdata": "",
"priority": "3",
"recovery_expression": "",
"recovery_mode": "0",
"state": "0",
"status": "0",
"templateid": "16198",
"triggerid": "16817",
"type": "0",
"url": "",
"value": "1"
},
{
"comments": "Per CPU load average is too high. Your system may be slow to respond.",
"correlation_mode": "0",
"correlation_tag": "",
"description": "Load average is too high (per CPU load over 1.5 for 5m)",
"error": "",
"expression": "{19277}/{19278}>{$LOAD_AVG_PER_CPU.MAX.WARN}\r\nand {19279}>0\r\nand {19280}>0",
"flags": "0",
"hosts": [
{
"auto_compress": "1",
"available": "1",
"description": "",
"disable_until": "0",
"error": "",
"errors_from": "0",
"flags": "0",
"host": "zabbix-agent-96x6t",
"hostid": "10084",
"inventory_mode": "-1",
"ipmi_authtype": "-1",
"ipmi_available": "0",
"ipmi_disable_until": "0",
"ipmi_error": "",
"ipmi_errors_from": "0",
"ipmi_password": "",
"ipmi_privilege": "2",
"ipmi_username": "",
"jmx_available": "0",
"jmx_disable_until": "0",
"jmx_error": "",
"jmx_errors_from": "0",
"lastaccess": "0",
"maintenance_from": "0",
"maintenance_status": "0",
"maintenance_type": "0",
"maintenanceid": "0",
"name": "zabbix-agent-96x6t",
"proxy_address": "",
"proxy_hostid": "0",
"snmp_available": "0",
"snmp_disable_until": "0",
"snmp_error": "",
"snmp_errors_from": "0",
"status": "0",
"templateid": "0",
"tls_accept": "1",
"tls_connect": "1",
"tls_issuer": "",
"tls_psk": "",
"tls_psk_identity": "",
"tls_subject": ""
}
],
"lastchange": "1589282230",
"manual_close": "0",
"opdata": "Load averages(1m 5m 15m): ({ITEM.LASTVALUE1} {ITEM.LASTVALUE3} {ITEM.LASTVALUE4}), # of CPUs: {ITEM.LASTVALUE2}",
"priority": "3",
"recovery_expression": "",
"recovery_mode": "0",
"state": "0",
"status": "0",
"templateid": "16564",
"triggerid": "16565",
"type": "0",
"url": "",
"value": "1"
}
]
}
Zabbix is an open-source monitoring software tool for diverse IT components, including networks, servers, virtual machines, and cloud services. It could be evaluated as a potential source for relevant telemetry data - triggers (alerts).
Prior to attempting implementation, we should collect some initial facts about Zabbix operation:
After collecting the above information, we will decide whether Zabbix can be integrated into OpenRCA and what elements must be taken into account in the development effort.