munin-monitoring / munin

Main repository for munin master / node / plugins
http://munin-monitoring.org
Other
1.99k stars 474 forks source link

After buster->bullseye upgrade, munin is misplacing some data from munin-async #1429

Open rincebrain opened 3 years ago

rincebrain commented 3 years ago

Describe the bug I have a munin server, formerly running 2.0.67-1~bpo10+1 out of buster-backports, and a number of clients, including the machine that is also running the server, running Raspbian or Debian of formerly buster and now bullseye or sid (and therefore 2.0.67-[something]). They're all using munin-asyncd and retrieval via config lines of the form (no additional lines than these per host):

[avalon]
    address ssh://munin-async@10.0.0.3
    use_node_name yes

and then .ssh/authorized_keys of

no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty,no-user-rc,from="10.0.0.2",command="/usr/share/munin/munin-async --spoolfetch" ssh-ed25519 [key] munin@[server]

Since upgrading from Debian buster->bullseye on the munin server (and thus from 2.0.67-1~bpo10+1 to 2.0.67-3), a bunch of plugins on that machine have started having large gaps in their reported results - but only on that machine, not on any of the other clients.

For example, here's the rendering of the uptime plugin for the past day/week - the leftmost gap is from when I shut things down to buster->bullseye upgrade, the rest were while the server was up. (The "users" plugin, for example, also has these gaps, the vmstat plugin, a couple of ZFS usage plugins, the SMART per-disk plugins...)

day week

I don't pretend to understand how rrdtool functions well, but here's what a sample from one of those gaps looks like in the munin-asyncd spool and rrdtool dump on the DB:

 # egrep '(<cf>|2021-09-27 13)' /tmp/uptime
                <cf>AVERAGE</cf>
                        <!-- 2021-09-27 13:00:00 EDT / 1632762000 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:05:00 EDT / 1632762300 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:10:00 EDT / 1632762600 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:15:00 EDT / 1632762900 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:20:00 EDT / 1632763200 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:25:00 EDT / 1632763500 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:30:00 EDT / 1632763800 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:35:00 EDT / 1632764100 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:40:00 EDT / 1632764400 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:45:00 EDT / 1632764700 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:50:00 EDT / 1632765000 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:55:00 EDT / 1632765300 --> <row><v>NaN</v></row>
                <cf>MIN</cf>
                        <!-- 2021-09-27 13:00:00 EDT / 1632762000 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:05:00 EDT / 1632762300 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:10:00 EDT / 1632762600 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:15:00 EDT / 1632762900 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:20:00 EDT / 1632763200 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:25:00 EDT / 1632763500 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:30:00 EDT / 1632763800 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:35:00 EDT / 1632764100 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:40:00 EDT / 1632764400 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:45:00 EDT / 1632764700 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:50:00 EDT / 1632765000 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:55:00 EDT / 1632765300 --> <row><v>NaN</v></row>
                <cf>MAX</cf>
                        <!-- 2021-09-27 13:00:00 EDT / 1632762000 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:05:00 EDT / 1632762300 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:10:00 EDT / 1632762600 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:15:00 EDT / 1632762900 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:20:00 EDT / 1632763200 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:25:00 EDT / 1632763500 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:30:00 EDT / 1632763800 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:35:00 EDT / 1632764100 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:40:00 EDT / 1632764400 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:45:00 EDT / 1632764700 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:50:00 EDT / 1632765000 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:55:00 EDT / 1632765300 --> <row><v>NaN</v></row>
                <cf>AVERAGE</cf>
                        <!-- 2021-09-27 13:00:00 EDT / 1632762000 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:30:00 EDT / 1632763800 --> <row><v>NaN</v></row>
                <cf>MIN</cf>
                        <!-- 2021-09-27 13:00:00 EDT / 1632762000 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:30:00 EDT / 1632763800 --> <row><v>NaN</v></row>
                <cf>MAX</cf>
                        <!-- 2021-09-27 13:00:00 EDT / 1632762000 --> <row><v>NaN</v></row>
                        <!-- 2021-09-27 13:30:00 EDT / 1632763800 --> <row><v>NaN</v></row>
# egrep '(uptime.value|timestamp)' /var/lib/munin-async/munin-daemon.uptime.1632700800.86400
timestamp 1632762525
uptime.value 2.84
timestamp 1632762825
uptime.value 2.85
timestamp 1632763125
uptime.value 2.85
timestamp 1632763425
uptime.value 2.86
timestamp 1632763725
uptime.value 2.86
timestamp 1632764025
uptime.value 2.86
timestamp 1632764325
uptime.value 2.87
timestamp 1632764625
uptime.value 2.87
timestamp 1632764925
uptime.value 2.87
timestamp 1632765225
uptime.value 2.88
timestamp 1632765525
uptime.value 2.88
timestamp 1632765825
uptime.value 2.88
timestamp 1632766125
uptime.value 2.89
timestamp 1632766425
uptime.value 2.89
timestamp 1632766725
uptime.value 2.89

To Reproduce I genuinely don't know. Since none of my other buster->bullseye upgrades have done this, I've not got any brilliant ideas.

Expected behavior I'd rather expect, even if the data didn't show up in munin initially for some reason, for it to get backfilled from the munin-async spool once it did.

Screenshots & Logs Above.

As I mention below, this really looks like it's timing out somewhere and not finishing reading in the available data, and then not backfilling it later, but I'd expect it to log something about that somewhere if it happened, and...

# grep steamer /var/log/munin/munin-update.log | grep -v zfs_cache_efficiency | egrep '(Exit value|starting work|Munin-update)'
2021/09/27 13:00:02 [INFO] starting work in 3846 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:00:18 [INFO]: Munin-update finished for node steamer;steamer (16.14 sec)
2021/09/27 13:00:28 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 13:05:02 [INFO] starting work in 31687 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:05:17 [INFO]: Munin-update finished for node steamer;steamer (15.68 sec)
2021/09/27 13:05:27 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 13:10:02 [INFO] starting work in 70210 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:10:16 [INFO]: Munin-update finished for node steamer;steamer (14.78 sec)
2021/09/27 13:10:26 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 13:15:02 [INFO] starting work in 102302 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:15:18 [INFO]: Munin-update finished for node steamer;steamer (16.49 sec)
2021/09/27 13:15:28 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 13:20:02 [INFO] starting work in 122813 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:20:16 [INFO]: Munin-update finished for node steamer;steamer (14.64 sec)
2021/09/27 13:20:26 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 13:25:02 [INFO] starting work in 33454 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:25:17 [INFO]: Munin-update finished for node steamer;steamer (15.39 sec)
2021/09/27 13:25:27 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 13:30:02 [INFO] starting work in 69280 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:30:18 [INFO]: Munin-update finished for node steamer;steamer (16.25 sec)
2021/09/27 13:30:28 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 13:35:02 [INFO] starting work in 99948 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:35:16 [INFO]: Munin-update finished for node steamer;steamer (14.39 sec)
2021/09/27 13:35:26 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 13:40:02 [INFO] starting work in 129341 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:40:18 [INFO]: Munin-update finished for node steamer;steamer (15.96 sec)
2021/09/27 13:40:28 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 13:45:02 [INFO] starting work in 39692 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:45:17 [INFO]: Munin-update finished for node steamer;steamer (15.30 sec)
2021/09/27 13:45:27 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 13:50:02 [INFO] starting work in 62053 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:50:16 [INFO]: Munin-update finished for node steamer;steamer (13.13 sec)
2021/09/27 13:50:26 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 13:55:01 [INFO] starting work in 68420 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 13:55:15 [INFO]: Munin-update finished for node steamer;steamer (13.42 sec)
2021/09/27 13:55:25 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0
2021/09/27 14:00:02 [INFO] starting work in 85847 for steamer (ssh://munin-async@127.0.0.1).
2021/09/27 14:00:17 [INFO]: Munin-update finished for node steamer;steamer (15.21 sec)
2021/09/27 14:00:27 [INFO] Reaping Munin::Master::UpdateWorker<steamer;steamer>.  Exit value/signal: 0/0

Desktop (please complete the following information):

Additional context I look forward to finding out how I've somehow configured munin in a mad way. :)

(My guess, looking at it, is that something is timing out and not retrieving the rest of the data initially, and then not backfilling the absent data later. But I'd expect to see an error somewhere...)

sumpfralle commented 3 years ago

Maybe the spoolfetch timestamp (see line 238 in UpdateWorker.pm) is not stored properly at the moment. I think, it is stored after each plugin ("service"). Instead it should be stored just once after all services are processed (with the smallest timestamp). Or maybe we need to store separate timestamps for each service (now there is just one timestamp for the host).

Just a wild guess ...

rincebrain commented 3 years ago

After some tinkering and trying to convince munin that actually, there really was some data it should replace with the data from munin-asyncd...

And suddenly, somehow, uptime decided I had been up all along. (As did the other plugins.)

rincebrain commented 3 years ago

It would seem munin-async is being naughty and returning less information.

2021/09/28 05:50:06 [WARNING] 7 lines had errors while 27043 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 05:55:06 [WARNING] 6 lines had errors while 10336 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:00:06 [WARNING] 6 lines had errors while 10334 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:05:06 [WARNING] 6 lines had errors while 10334 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:10:05 [WARNING] 6 lines had errors while 10336 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:15:06 [WARNING] 6 lines had errors while 10332 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:20:05 [WARNING] 7 lines had errors while 11092 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:25:06 [WARNING] 6 lines had errors while 10895 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:30:06 [WARNING] 7 lines had errors while 27043 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949

My speculation is that if munin-update tries a --spoolfetch while munin-asyncd is trying to retrieve data while --fork is in use, and a subset of plugins have completed, it will return those and nothing else, and munin-update will conclude "welp that's everything"...

rincebrain commented 3 years ago

Playing with min_update_rate_threshold didn't help, but since adjusting minrate from 300 to 150, it hasn't recurred...so far.

I also have the full back and forth communication to munin-async --spoolfetch from a time when it delivered many fewer results and when it delivered the "full" normal amount of results, if anyone wants to examine all or part of it. Since this is tens of thousands of lines, I wouldn't blame anyone for not wanting to.

spacelama commented 10 months ago

Still happening January 2024 alas for a bookworm server, and most affecting a bullseye node.

I tried updating min_update_rate_threshold to 150 amongst various other things, but it only helped partially, for a while, and not for all munin nodes.

spacelama commented 9 months ago

For those coming back a week later and wondering what the hell they are talking about with min_update_rate_threshold, I appear to be talking about telling systemd to start munin-asyncd with --minrate 150

ExecStart=/usr/share/munin/munin-asyncd --minrate 150

I've ended up going with --fork instead of minrate 150, but more importantly, I discovered /etc/munin/munin.conf hadn't received any timeout_fetch_all_nodes and timeout_fetch_one_node settings since the machine hosting munin was upgraded from a prior version without those settings, and so was defaulting to 240 and 180. I adjusted them to 2400 and 1800 and I think that's fixed my problem.