Open rincebrain opened 3 years ago
Maybe the spoolfetch
timestamp (see line 238 in UpdateWorker.pm) is not stored properly at the moment.
I think, it is stored after each plugin ("service"). Instead it should be stored just once after all services are processed (with the smallest timestamp). Or maybe we need to store separate timestamps for each service (now there is just one timestamp for the host).
Just a wild guess ...
After some tinkering and trying to convince munin that actually, there really was some data it should replace with the data from munin-asyncd...
/var/lib/munin/[host]
with copies from Saturday along with state-[host]-[host].storable
, as well as /var/lib/munin-async
for good measure.And suddenly, somehow, uptime decided I had been up all along. (As did the other plugins.)
It would seem munin-async is being naughty and returning less information.
2021/09/28 05:50:06 [WARNING] 7 lines had errors while 27043 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 05:55:06 [WARNING] 6 lines had errors while 10336 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:00:06 [WARNING] 6 lines had errors while 10334 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:05:06 [WARNING] 6 lines had errors while 10334 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:10:05 [WARNING] 6 lines had errors while 10336 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:15:06 [WARNING] 6 lines had errors while 10332 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:20:05 [WARNING] 7 lines had errors while 11092 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:25:06 [WARNING] 6 lines had errors while 10895 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
2021/09/28 06:30:06 [WARNING] 7 lines had errors while 27043 lines were correct in data from 'config __root__' on steamer/ssh://munin-async@10.0.0.2/4949
My speculation is that if munin-update tries a --spoolfetch while munin-asyncd is trying to retrieve data while --fork is in use, and a subset of plugins have completed, it will return those and nothing else, and munin-update will conclude "welp that's everything"...
Playing with min_update_rate_threshold didn't help, but since adjusting minrate from 300 to 150, it hasn't recurred...so far.
I also have the full back and forth communication to munin-async --spoolfetch from a time when it delivered many fewer results and when it delivered the "full" normal amount of results, if anyone wants to examine all or part of it. Since this is tens of thousands of lines, I wouldn't blame anyone for not wanting to.
Still happening January 2024 alas for a bookworm server, and most affecting a bullseye node.
I tried updating min_update_rate_threshold to 150 amongst various other things, but it only helped partially, for a while, and not for all munin nodes.
For those coming back a week later and wondering what the hell they are talking about with min_update_rate_threshold, I appear to be talking about telling systemd to start munin-asyncd with --minrate 150
ExecStart=/usr/share/munin/munin-asyncd --minrate 150
I've ended up going with --fork instead of minrate 150, but more importantly, I discovered /etc/munin/munin.conf hadn't received any timeout_fetch_all_nodes and timeout_fetch_one_node settings since the machine hosting munin was upgraded from a prior version without those settings, and so was defaulting to 240 and 180. I adjusted them to 2400 and 1800 and I think that's fixed my problem.
Describe the bug I have a munin server, formerly running 2.0.67-1~bpo10+1 out of buster-backports, and a number of clients, including the machine that is also running the server, running Raspbian or Debian of formerly buster and now bullseye or sid (and therefore 2.0.67-[something]). They're all using munin-asyncd and retrieval via config lines of the form (no additional lines than these per host):
and then .ssh/authorized_keys of
Since upgrading from Debian buster->bullseye on the munin server (and thus from 2.0.67-1~bpo10+1 to 2.0.67-3), a bunch of plugins on that machine have started having large gaps in their reported results - but only on that machine, not on any of the other clients.
For example, here's the rendering of the uptime plugin for the past day/week - the leftmost gap is from when I shut things down to buster->bullseye upgrade, the rest were while the server was up. (The "users" plugin, for example, also has these gaps, the vmstat plugin, a couple of ZFS usage plugins, the SMART per-disk plugins...)
day week
I don't pretend to understand how rrdtool functions well, but here's what a sample from one of those gaps looks like in the munin-asyncd spool and rrdtool dump on the DB:
To Reproduce I genuinely don't know. Since none of my other buster->bullseye upgrades have done this, I've not got any brilliant ideas.
Expected behavior I'd rather expect, even if the data didn't show up in munin initially for some reason, for it to get backfilled from the munin-async spool once it did.
Screenshots & Logs Above.
As I mention below, this really looks like it's timing out somewhere and not finishing reading in the available data, and then not backfilling it later, but I'd expect it to log something about that somewhere if it happened, and...
Desktop (please complete the following information):
Additional context I look forward to finding out how I've somehow configured munin in a mad way. :)
(My guess, looking at it, is that something is timing out and not retrieving the rest of the data initially, and then not backfilling the absent data later. But I'd expect to see an error somewhere...)