munin-monitoring / munin

Main repository for munin master / node / plugins
http://munin-monitoring.org
Other
1.95k stars 469 forks source link

Unreliable behavior of munin master when there are no new spoolfetch data available #634

Open shapirus opened 8 years ago

shapirus commented 8 years ago

In case when there are no new data available from munin-async when munin master connects to the target node and gets the due spoolfetch data, the graphs for which the new data were not produced become excluded from the content generated by munin-httpd, both html and graphs themselves.

Steps to reproduce:

  1. set up munin-master with a node in an asynchronous configuration (munin-async+munin-asyncd)
  2. wait until the spoolfetch dir gets populated with some data gathered from munin-node
  3. remove all generated files from /var/lib/munin on the master server
  4. run munin-cron
  5. visit the munin-httpd page and see that all the graphs are present and displayed correctly
  6. immediately run munin-cron again before the next munin-asyncd run pulls new data from munin-node
  7. visit the munin-httpd page again and see that there are now no graphs whatsoever
  8. run munin-cron after munin-asyncd has pulled some new data and see that the graphs reappear.

It appears that it is a political rather than technical question of what we want to do about the graphs for which there is no recent data: hide them altogether or display them with empty space for which there are no data (yet).

To me it sounds like the best solution would be to always display all graphs even when their data are stale until they reach a certain (ideally user-configurable) age of not being updated.

A curious consequence of current behavior is the race condition which I encountered during testing a single-node munin master + munin-node with munin-async/spoolfetch setup. Currently munin-asynd launches its data gathering runs on multiples of 5 minutes: 00, 05, 10 etc. At the same time, munin-cron is by default configured with the same crontab schedule: " /5 ". Sure thing, it so happens that munin-update connects to the target node (self in my case) and runs munin-async to fetch the spooldir data at the very moment that the same spooldir is being populated by munin-asyncd. This results in an incomplete data set received by munin-update and consequently missing graphs. In my case it was a curious behavior like "everything is fine after the first munin-cron run, but some graphs become missing starting from the second run and never reappear". In my setup I solved this by changing the munin-cron schedule to " 1-59/ " so that it runs one minute after munin-asyncd runs, by which time all spoolfetch data should usually be ready.

shtrom commented 4 years ago

This behaviour is still present on munin 2.0.49-1 on Debian and munin-* 2.0.47 on OpenBSD.

sumpfralle commented 4 years ago

Does anyone has an idea, where the code decides to not generate the respective graphs?

shapirus commented 4 years ago

Too bad I don't remember it now, I've been not using munin since the end of 2016. However the code must be somewhere in what munin-cron executes not long before graph generation. It should not be too difficult to trace it. I think I had actually found it, not even being very familiar with munin code in general.

shtrom commented 4 years ago

Sounds like it could be due to this: https://github.com/munin-monitoring/munin/issues/1091#issuecomment-424145219

We don't get any config, so we can't build the graph.

sumpfralle commented 4 years ago

Sounds like it could be due to this: #1091 (comment)

I clarified the comment, that you refer to (written by me). It was corrected later in that discussion by niclan. Summary: in theory old configuration data should be kept even in case of recent connection errors.

But your comment encouraged me to read the initial description of the bug reporter again:

  1. remove all generated files from /var/lib/munin on the master server

If the file /var/lib/munin/datafile is removed on the master, then there is no reference regarding the configuration of previously encountered plugins on remote hosts. Thus the graphs cannot be generated for a good reason.

But I assume, that you (@shtrom) encountered the removal of graphs under different circumstances. Could you describe these, please?

shtrom commented 4 years ago

I can look up on my system to see about the datafile.

Apart from that, I did see exactly what this issue describes: switching from direct remote munin-node to local munin-asyncd, when the remote munin-update runs at */5 in cron, sometime, some graphs disappear. The generally reappear a few periods later, and then some others disappear.

As suggested by the OP, I changed the periodicity of munin-cron to 3-59/5 (so it runs every 5 minutes at 2, 8, 13, ...), and the graphs stopped disappearing.

shtrom commented 4 years ago

Ok, the datafile looks like some sort of “latest snapshot” of the data received. But if the latest spoolfetch did not have any new data for a particular plugin, it would simply not output anything.

So, maybe, the datafile would no longer have any reference to that particular plugin, and the next run of graph generation would skip it?

sumpfralle commented 4 years ago

I assume (I do not use it - thus I am just guessing), that the async data collection works as follows:

  1. munin-asyncd gathers data on a node and stores it locally
  2. the master periodically executes munin-async on the node (as part of its data collection): munin-async outputs the locally stored data
  3. the master processes the incoming data precisely as it would process data from a regular munin-node service

Thus I think, there can be no sources of problems in (1) or (2), since even missing data snippets would simply lead to missing data points. These would be compensated in (3), since the master keeps its knowlegde about old plugin configrations in its datafile, even if they went missing for a few periods.

As far as I can see, access to the datafile is implemented in a robust and safe mode at the moment (see ce1c507b23a898e375c6f8dd575fc1af61338f5e and 8d225d0883b9259fc41ffe83cb2dd956b9284c83 for significant fixes in v2.0.48 and v2.0.20).

Thus at the moment (unless more problems with parsing/updating the datafile are hidden), I can hardly imagine, how a plugin is graphed successfully, but disappears in the next turn due the lack of new input data.

Regarding the timing collision of the local munin-asyncd execution and the data gathering of the master: here I can imagine problems with non-atomic writes on the remote node. But after taking a quick look at the code (node/_bin/munin-asyncd.in) it looks like the only problem could be partially written plugin data. This could result in a partial plugin data transfer, when the master collects these. I assume, that the master can handle these gracefully. Otherwise I would at least expect warning messages in the log of munin-update.

Summary: I am a bit lost, where exactly there could be an issue. Access to a host with such a behaviour or (even better) a simple and quite reliable set of instructions for reproducing this issue would be nice. The original reporter's description is not usable, since its third step removes the datafile - which obviously leads to expectable problems.

shtrom commented 4 years ago

AFAIK, your description is correct.

I think a standard munin-async setup as described in the doc should be sufficient. Supplement that enabling all the plugins suggested from a munin-configure for good measure, and you should be about right.

If it still doesn't happen, perhaps hack, say, the uptime plugin, to add an arbitrary long delay (maybe 1 or 2 minutes).

shtrom commented 4 years ago

Actually, maybe a different way to trigger this:

ssm commented 4 years ago

That's a "since the beginning of time" issue. Running "munin-update" writes the statefile used by the other munin-* master components. If you restrict what munin-update will fetch, that will also reduce the the data available for munin-graph and munin-html.

See also #720

shtrom commented 4 years ago

Right, it does sound like #720 would solve this issue, too.