lesovsky / zabbix-extensions

Zabbix additional monitoring modules
BSD 3-Clause "New" or "Revised" License
316 stars 230 forks source link

Iostat parse issue in iostat.conf? #88

Closed gglybin closed 2 years ago

gglybin commented 2 years ago

Hello,

I've setup iostat script/conf/template to monitor IO utilization with Zabbix. But it looks like something is wrong with parse command defined in iostat.conf. Utilization field shows wrong number:

# cat /tmp/iostat-cron.out | grep -i sdd
sdd               0.37     5.81 5843.25 3000.61 155660.68 60781.83    48.95    25.60    2.89    1.50    5.61   0.09  83.71

# grep -w sdd /tmp/iostat-cron.out | awk 'BEGIN {n=split("rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm util", arr);}{print "{"}{for(i=1;i<=n;++i){printf("\t\"%s\":\"%.2f\"", arr[i], $i); if(i<=n){printf(",\n");}}}{print "\n}"}'
{
        "rrqm/s":"0.00",
        "wrqm/s":"0.37",
        "r/s":"5.81",
        "w/s":"5843.25",
        "rkB/s":"3000.61",
        "wkB/s":"155660.68",
        "avgrq-sz":"60781.83",
        "avgqu-sz":"48.95",
        "await":"25.60",
        "r_await":"2.89",
        "w_await":"1.50",
        "svctm":"5.61",
        "util":"0.09",

}

Can you please have a look? Maybe it's my mistake?

gglybin commented 2 years ago
# cat /var/lib/iostat/iostat-collect.sh
++++++++++++++
#!/bin/bash
# Description:  Script for iostat monitoring
# Author:       Epikhin Mikhail michael@nomanlab.org
# Revision 1:   Lesovsky A.V. lesovsky@gmail.com
# Revision 2:   Sherstuk M.Y. maxim.sherstuk@gmail.com

SECONDS=$2
TOFILE=$1
IOSTAT="/usr/bin/iostat"

# be portable regarding number format
LC_ALL=C ; export LC_ALL

[[ $# -lt 2 ]] && { echo "FATAL: some parameters not specified"; exit 1; }

DISK=$($IOSTAT -xyd "$SECONDS" 1 | awk 'BEGIN {check=0;} {if(check==1 && $1!=""){print $0}if($1~"^Device"){check=1}}' | tr '\n' '|')
echo "$DISK" | sed 's/|/\n/g' > "$TOFILE"
++++++++++++++

# cat /etc/zabbix/zabbix_agentd.d/iostat.conf
++++++++++++++
# Disk statistics via iostat (sysstat)
UserParameter=iostat.discovery, iostat -d | awk 'BEGIN {check=0;count=0;array[0]=0;} {if(check==1 && $1 != ""){array[count]=$1;count=count+1;}if($1~"^Device"){check=1;}} END {printf("{\n\t\"data\":[\n");for(i=0;i<count;++i){printf("\t\t{\n\t\t\t\"{#HARDDISK}\":\"%s\"}", array[i]); if(i+1<count){printf(",\n");}} printf("]}\n");}'
UserParameter=iostat.summary[*], grep -w $1 /tmp/iostat-cron.out | tail -n 1 | awk 'BEGIN {n=split("rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm util", arr);}{print "{"}{for(i=1;i<=n;++i){printf("\t\"%s\":\"%.2f\"", arr[i], $i); if(i<n){printf(",\n");}}}{print "\n}"}'
++++++++++++++
gglybin commented 2 years ago

It looks like device name is the reason of my pain. Here are two "solutions" I had to use.

Solution 1: Just add fake entry like below so command can get last value associated with utilization.

# iostat -V
+++++
sysstat version 10.1.5
(C) Sebastien Godard (sysstat <at> orange.fr)
+++++

# iostat -xyd sdb
+++++
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     1.10    0.02    3.15     0.68    21.90    14.25     0.00    0.52    1.00    0.52   0.49   0.16
+++++

# cat /tmp/iostat-cron.out  | grep -i sdb
+++++
sdb               0.00     1.10    0.02    3.15     0.68    21.90    14.25     0.00    0.52    1.00    0.52   0.49   0.16
+++++

# grep -w sdb /tmp/iostat-cron.out | tail -n 1 | awk 'BEGIN {n=split("patch rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm util", arr);}{print "{"}{for(i=1;i<=n;++i){printf("\t\"%s\":\"%.2f\"", arr[i], $i); if(i<n){printf(",\n");}}}{print "\n}"}'
+++++
{
        "patch":"0.00",
        "rrqm/s":"0.00",
        "wrqm/s":"1.10",
        "r/s":"0.02",
        "w/s":"3.15",
        "rkB/s":"0.68",
        "wkB/s":"21.90",
        "avgrq-sz":"14.25",
        "avgqu-sz":"0.00",
        "await":"0.52",
        "r_await":"1.00",
        "w_await":"0.52",
        "svctm":"0.49",
        "util":"0.16"
}
+++++

Solution 2: Add sed like below to avoid device name be a part of output.

# grep -w sdb /tmp/iostat-cron.out | tail -n 1 | sed 's/[^ ]* //' | awk 'BEGIN {n=split("rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm util", arr);}{print "{"}{for(i=1;i<=n;++i){printf("\t\"%s\":\"%.2f\"", arr[i], $i); if(i<n){printf(",\n");}}}{print "\n}"}'
{
        "rrqm/s":"0.00",
        "wrqm/s":"1.10",
        "r/s":"0.02",
        "w/s":"3.15",
        "rkB/s":"0.68",
        "wkB/s":"21.90",
        "avgrq-sz":"14.25",
        "avgqu-sz":"0.00",
        "await":"0.52",
        "r_await":"1.00",
        "w_await":"0.52",
        "svctm":"0.49",
        "util":"0.16"
}

I'm not a specialist with awk/sed, just googled it and not sure if this is good enough.

Please advice if I'm wrong.

Thanks.

stephankn commented 2 years ago

I am not using the iostat template myself. just to be certain, that I understand this bug report correctly. The problem is that the values are "shifted" by one column?

A while ago I merged in some patches and at that time I already disliked the way the values are collected. I think that the JSON output would be much more suitable. But that would require some larger rework of the scripts and template.

I might have some time later, so will have a closer look. I'm going to set up a test system with Ubuntu 20.04 LTS. Are you using a specific software release which might run on unusual old or new software versions?

stephankn commented 2 years ago

the problem looks worse than I first thought.

iostat in your version 10.1.5 produces 13 items output. The Ubuntu 20.04 release with version 12.2.0 produce much more output:

root@ce52b0397de6:/# iostat -V
sysstat version 12.2.0
(C) Sebastien Godard (sysstat <at> orange.fr)
root@ce52b0397de6:/# iostat -xyd md2
Linux 5.4.0-88-generic (ce52b0397de6)   10/24/21        _x86_64_        (12 CPU)

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
md2              9.72    153.66     0.00   0.00    0.00    15.81   27.32    568.93     0.00   0.00    0.00    20.82    0.51   1556.09     0.00   0.00    0.00  3033.16    0.00   0.00

And mangape indicated that more fields might be added in future versions. So simply parsing the human readable output does not sound reasonable.

I think I will rework the template to work with iostat JSON output as awk parsing has no future.

stephankn commented 2 years ago

@gglybin I believe my recent change fixes the iostat. I was only able to test it on a reduced demo-system. The item await was not available for me. I am not certain why. Would you be able to test the new template? Update is simple. On server side replace the iostat.conf file and adjust crontab. The collector script is no longer needed. On Zabbix side import the new template and let Zabbix update existing entries. Template now requires Zabbix 5.0 as minimum. With Support ending end of October for 4.0, I think this is OK.

If this works, I could update one or two other things with the script to optimize the load and use default values for some items.

gglybin commented 2 years ago

@stephankn Thanks for your time. I'm not sure if new cron will work correctly with JSON iostat option. I mean when iostat collects report for last 59 sec I have below output file:

# cat /tmp/iostat-cron.out
+++++
{"sysstat": {
        "hosts": [
                {
                        "nodename": "server-name",
                        "sysname": "Linux",
                        "release": "5.4.0-66-generic",
                        "machine": "x86_64",
                        "number-of-cpus": 2,
                        "date": "10/25/2021",
                        "statistics": [
+++++

So I guess actual values stored for just a second until next iostat execution will start through cron.

If I change crontab like below:

+++++
# crontab -l
* * * * * /usr/bin/iostat -xyd -o JSON 40 1 > /tmp/iostat-cron.out
+++++

I can view metrics during 20 sec window, until next run begins:

# date && cat /tmp/iostat-cron.out
+++++
Mon Oct 25 13:45:43 MSK 2021
{"sysstat": {
        "hosts": [
                {
                        "nodename": "server-name",
                        "sysname": "Linux",
                        "release": "5.4.0-66-generic",
                        "machine": "x86_64",
                        "number-of-cpus": 2,
                        "date": "10/25/2021",
                        "statistics": [
                                {
                                        "disk": [
                                                {"disk_device": "vda", "r/s": 0.00, "w/s": 10.67, "rkB/s": 0.00, "wkB/s": 30.10, "rrqm/s": 0.00, "wrqm/s": 0.92, "rrqm": 0.00, "wrqm": 7.97, "r_await": 0.00, "w_await": 0.44, "aqu-sz": 0.00, "rareq-sz": 0.00, "wareq-sz": 2.82,  "svctm": 1.70, "util": 1.82}
                                        ]
                                }
                        ]
                }
        ]
}}
+++++

# date && cat /tmp/iostat-cron.out
+++++
Mon Oct 25 13:45:58 MSK 2021
{"sysstat": {
        "hosts": [
                {
                        "nodename": "server-name",
                        "sysname": "Linux",
                        "release": "5.4.0-66-generic",
                        "machine": "x86_64",
                        "number-of-cpus": 2,
                        "date": "10/25/2021",
                        "statistics": [
                                {
                                        "disk": [
                                                {"disk_device": "vda", "r/s": 0.00, "w/s": 10.67, "rkB/s": 0.00, "wkB/s": 30.10, "rrqm/s": 0.00, "wrqm/s": 0.92, "rrqm": 0.00, "wrqm": 7.97, "r_await": 0.00, "w_await": 0.44, "aqu-sz": 0.00, "rareq-sz": 0.00, "wareq-sz": 2.82,  "svctm": 1.70, "util": 1.82}
                                        ]
                                }
                        ]
                }
        ]
}}
+++++

# date && cat /tmp/iostat-cron.out
+++++
Mon Oct 25 13:46:03 MSK 2021
{"sysstat": {
        "hosts": [
                {
                        "nodename": "server-name",
                        "sysname": "Linux",
                        "release": "5.4.0-66-generic",
                        "machine": "x86_64",
                        "number-of-cpus": 2,
                        "date": "10/25/2021",
                        "statistics": [
+++++

Do you have some thoughts on how to play around it? I'm afraid Zabbix will not be able to get metric value from output file because of it.

Thanks.

stephankn commented 2 years ago

Probably you are right. As said: I have only a limited demo system. It did not run the iostat with cron. So that part was untested.

But thinking again likely the file is overwritten once cron starts. Then iostat blocks for the confoigured sampling intervall and at the end it writes out the data and shell redirect closes the file. Intention is different. It should write out the file after iostat returns. So probably that was the reason why initial script author captured the output in a variable.

I'm reopening the issue. Is besides the broken cron the template working for you?

gglybin commented 2 years ago

You are absolutely right. I've changed script as below:

# cat /usr/libexec/zabbix-extensions/scripts/iostat-collect.sh
+++++
#!/bin/bash
# Description:  Script for iostat monitoring
# Author:       Epikhin Mikhail michael@nomanlab.org
# Revision 1:   Lesovsky A.V. lesovsky@gmail.com
# Revision 2:   Sherstuk M.Y. maxim.sherstuk@gmail.com

SECONDS=$2
TOFILE=$1
IOSTAT="/usr/bin/iostat"

# be portable regarding number format
LC_ALL=C ; export LC_ALL

[[ $# -lt 2 ]] && { echo "FATAL: some parameters not specified"; exit 1; }

#DISK=$($IOSTAT -xyd "$SECONDS" 1 | awk 'BEGIN {check=0;} {if(check==1 && $1!=""){print $0}if($1~"^Device"){check=1}}' | tr '\n' '|')
#echo "$DISK" | sed 's/|/\n/g' > "$TOFILE"

DISK=$($IOSTAT -xyd -o JSON "$SECONDS" 1)
echo "$DISK" > "$TOFILE"
+++++

Going to check if template is working fine. Will let you know asap.

Thanks.

gglybin commented 2 years ago

Here is my final setup:

$ crontab -l
++++++++++++
# Collect data iops for zabbix:
* * * * * /usr/libexec/zabbix-extensions/scripts/iostat-collect.sh /tmp/iostat-cron.out 59 >/dev/null 2>&1
++++++++++++

$ cat /usr/libexec/zabbix-extensions/scripts/iostat-collect.sh
++++++++++++
#!/bin/bash
# Description:  Script for iostat monitoring
# Author:       Epikhin Mikhail michael@nomanlab.org
# Revision 1:   Lesovsky A.V. lesovsky@gmail.com
# Revision 2:   Sherstuk M.Y. maxim.sherstuk@gmail.com

SECONDS=$2
TOFILE=$1
IOSTAT="/usr/bin/iostat"

# be portable regarding number format
LC_ALL=C ; export LC_ALL

[[ $# -lt 2 ]] && { echo "FATAL: some parameters not specified"; exit 1; }

##
##DISK=$($IOSTAT -xyd "$SECONDS" 1 | awk 'BEGIN {check=0;} {if(check==1 && $1!=""){print $0}if($1~"^Device"){check=1}}' | tr '\n' '|')
##echo "$DISK" | sed 's/|/\n/g' > "$TOFILE"
##

##
## Ref. to https://github.com/lesovsky/zabbix-extensions/issues/88
##

DISK=$($IOSTAT -xyd -o JSON "$SECONDS" 1)
echo "$DISK" > "$TOFILE"
++++++++++++

$ cat /etc/zabbix/zabbix_agentd.conf.d/iostat.conf
++++++++++++
# Disk statistics via iostat (sysstat)
UserParameter=iostat.discovery, iostat -d | awk 'BEGIN {check=0;count=0;array[0]=0;} {if(check==1 && $1 != ""){array[count]=$1;count=count+1;}if($1~"^Device"){check=1;}} END {printf("{\n\t\"data\":[\n");for(i=0;i<count;++i){printf("\t\t{\n\t\t\t\"{#HARDDISK}\":\"%s\"}", array[i]); if(i+1<count){printf(",\n");}} printf("]}\n");}'
UserParameter=iostat.summary[*], grep -w $1 /tmp/iostat-cron.out
++++++++++++

Template is working fine, but because of iostat version 11.6.1 I have to correct some Item prototypes (json keys). It's not a big problem and I suppose there is no way to get rid of it.

Stephan, thanks for your help once again.

gglybin commented 2 years ago

There is an issue when I have multiple devices on server. I'll take a look on it tomorrow and let you know.

stephankn commented 2 years ago

I checked it with discovery seeing four different devices. Template can handle this. It is doing unnecessary requests, but besides this it should work.

stephankn commented 2 years ago

@gglybin the helper script is no longer needed. The two commands can also run inside cron. I adjusted the crontab sample in the repository. Your script does the same, so you could keep it. The iostat.conf file needs some adjustment. Please do not grep for anything. Zabbix needs the full JSON output as the dependent items later fetch the values using a JSONPath expression.

So please also update the iostat.conf (don't forget to restart agent) and to update the template which contains the search expression.

All files in https://github.com/lesovsky/zabbix-extensions/tree/master/files/iostat

stephankn commented 2 years ago

If you update, you will receive two more improvements. One is for optimizing the performance. It only does a single agent call to fetch all statistics. This is a massive improvement if you monitor multiple devices. The other adds a timestamp monitoring of the stats collector, to ensure you are monitoring live data.

gglybin commented 2 years ago

Everything is working great! Thanks a lot for your help.