Open doublehp opened 7 years ago
Same problem with me. Ubuntu 16.04.3 LTS amd64
/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd': No such file or directory
Can you imagined, what could have caused this message?
Does the problem still exist?
If yes: maybe you could hunt the file access down with strace
?
Munin version 2.0.33-1
root@leon-03:~ 502# cat /var/log/munin/munin-graph.log | grep slab-reclaim-slab_unreclaimable-g | wc 1472 8096 226320
root@leon-03:~ 503# cat /var/log/munin/munin-graph.log | grep slab-reclaim-slab_unreclaimable-g | tail 'DEF:i165c718cebb0b56=/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd:42:MIN' \ 'DEF:g165c718cebb0b56=/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd:42:AVERAGE' \ 2018/02/21 10:13:48 [RRD ERROR] Unable to graph /var/cache/munin/www/doublehp.org/leon-03.doublehp.org/slab/reclaim-week.png : opening '/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd': No such file or directory 'DEF:a165c718cebb0b56=/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd:42:MAX' \ 'DEF:i165c718cebb0b56=/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd:42:MIN' \ 'DEF:g165c718cebb0b56=/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd:42:AVERAGE' \ 2018/02/21 10:13:48 [RRD ERROR] Unable to graph /var/cache/munin/www/doublehp.org/leon-03.doublehp.org/slab/reclaim-year.png : opening '/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd': No such file or directory 'DEF:a165c718cebb0b56=/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd:42:MAX' \ 'DEF:i165c718cebb0b56=/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd:42:MIN' \ 'DEF:g165c718cebb0b56=/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd:42:AVERAGE' \
Cause: a broken dep tree. This bugs sounds to me like https://github.com/munin-monitoring/munin/issues/828
The huge difference for me is that, 829 is seen in an official plugin that was installed by Debian (and it should happen for 100% users of munin - who use cron-graph; would be fun to check if issue occurs for CGI), while 828 happens on home made plugins. The root cause may be the same. Fixing one may fix 99% of the other. So, I would start by fixing 828, and then, hope it would fix 829.
No clue how to use strace inside a system service. But this bug is very easy to repro: on any VM capable host, install a fresh debian, munin, enable this plugin, and you got the bug (45mn for all steps).
Maybe this is a stupid question: does /var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd
exist?
Here on my system the file has the following name:
/var/lib/munin/foo/bar-slab-reclaim-slab_unreclaimable-g.rrd
The difference is a dash instead of a slash between hostname and graph file. Do you have an idea, where this could come from?
No clue how to use strace inside a system service.
Since you are using the cron-based execution, you could prefix the munin call in the crontab with strace -o /tmp/munin-strace.log -f ...
. Afterwards you can take a look at the log file and see if it was really the proper path that was checked.
(this check is not important, if the above rrd file does not exist)
ls -lha /var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd
ls: cannot access '/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd': No such file or directory
find /var/lib/munin/doublehp.org/leon-03.doublehp.org/ | wc
find: â/var/lib/munin/doublehp.org/leon-03.doublehp.org/â: No such file or directory
0 0 0
# find /var/lib/munin/doublehp.org/ | wc
4231 4231 359302
# find /var/lib/munin/doublehp.org/ | grep -i unreclaimable
/var/lib/munin/doublehp.org/leon-01.doublehp.org-slab-reclaim-slab_unreclaimable-g.rrd
/var/lib/munin/doublehp.org/leon-03.doublehp.org-slab-reclaim-slab_unreclaimable-g.rrd
as I said, the dep tree manager is heavily broken when we talk about multigraph. Working on bug 828 will give you many tips.
I don't care breaking munin for a few slots; but editing the cron file and wait 20mn for execution really bores me. Please give me a ready to use command. In particular, we can not strace scripts; and thus, don't want to strace /usr/bin/munin-cron . And stracing python is probably "pointless" ... so please, be explicit and specific about the debug command you want.
Your command output seems to indicate, that indeed the update
process uses the path /var/lib/munin/doublehp.org/leon-01.doublehp.org-slab-reclaim-slab_unreclaimable-g.rrd
, while the graph
process uses /var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd
:
update
: foo-bar.rrd
graph
: foo/bar.rrd
as I said, the dep tree manager is heavily broken when we talk about multigraph. Working on bug 828 will give you many tips.
I have no idea right now. Thus if you have a feeling for the problem: please share it with us or explore it further. I am interested in a solution.
In particular, we can not strace scripts [..]
I am not sure, if you meant the technical aspect or its feasibility: you can easily use strace with scripts.
In the case above it would have been interesting to see the output of grep slab-reclaim-slab_unreclaimable-g.rrd
for the strace log file. But my confusion regarding the existence of the rrd file on your system was already solved with your last comment.
I believe that stracing a python script is pointless; there shall be verbose python flags; but give me any command, and I will report the output.
[..] but give me any command, and I will report the output.
As I said: I have no idea, what could be related to this issue. Since you expressed a certain understanding of the problem, I would be happy, if you could investigate it. I certainly do not know more about it than you - thus I am not in a better position to fix it.
/var/log/munin/munin-update.log
2018/03/06 18:15:12 [WARNING] 4 lines had errors while 1071 lines were correct in data from 'config meminfo' on host-01.doublehp.org/host-01/4949 2018/03/06 18:15:26 [WARNING] 4 lines had errors while 1071 lines were correct in data from 'config meminfo' on host-02.doublehp.org/host-02/4949
4 lines had errors while 1071 lines were correct in data from 'config meminfo' on host-01.doublehp.org/host-01/4949
Here in my munin-update.log
I see the relevant broken lines just above these kind of log messages. Can you share these?
nope; I only see this warning alone; it's surrounded by INFO lines from other plugins and hosts. I am using cat munin-update.log | grep -i meminfo -B5 -A5 and I don't see any related relevant lines near by.
OK - I see.
The related messages were logged with DEBUG
level - I just raised this to INFO
: 34bf169ea61077772cb6a0d979acc3bb1d15a247
These are the parse messages:
[CONFIG multigraph meminfo] Service is now meminfo_phisical.mmap
[DEBUG] Protocol exception: unrecognized line 'direct_map_2m.info ' from meminfo on www/www/4949.
[DEBUG] Protocol exception: unrecognized line 'direct_map_4k.info ' from meminfo on www/www/4949.
[CONFIG multigraph meminfo] Service is now meminfo_virtual
[DEBUG] Protocol exception: unrecognized line 'mlocked.info ' from meminfo on www/www/4949.
[DEBUG] Protocol exception: unrecognized line 'shmem.info ' from meminfo on www/www/4949.
The underlying issue was already fixed in 4ce6dd69d90667dfdc3ac19ae2556cec8b093910 (on master). We will backport this to stable-2.0.
Anyway: this seems to be unrelated to the original issue?
The debug messages should be a separate issue from the filename-path-construction problem at the top here.
@doublehp : the server side config for the node in question is relevant for how the path is constructed. Are you able to provide that this long after?
The server conf have not changed much since then. But I am not going to paste it all.
The most important change I did recently was to move from cron to CGI. The rest was minor changes like ... adding criticals around.
# cat /etc/munin/munin.conf | grep -v -e "^#" -e "\.warning" -e "\.critical"
includedir /etc/munin/munin-conf.d
graph_strategy cgi
max_graph_jobs 6
munin_cgi_graph_jobs 6
html_strategy cgi
contact.syslog.command logger -p user.crit -t "*** Munin-Alert"
[...] # previous hosts
[leon-03.doublehp.org]
address leon-03
use_node_name yes
#exim_mailstats.graph_period minute
exim_mailstats.graph_period hour
apt_all.graph_category debian
loggrep_exim.graph_period hour
loggrep.graph_category loggrep
loggrep.graph_period minute
loggrep.graph_order total count kernel cron munin muninalert unifi mongod initandlisten clientcursormon datafilesync conn exim zma zmb zmc zmd zms zmw zmpkg completed in out conv upsd upslost upsmon bash bashhist sudo info warn missing err error crit fatal
loggrep_messages.graph_category loggrep
loggrep_messages.graph_period minute
loggrep_messages.graph_order total count kernel cron munin muninalert unifi mongod initandlisten clientcursormon datafilesync conn exim zma zmb zmc zmd zms zmw zmpkg completed in out conv upsd upslost upsmon bash bashhist sudo info warn missing err error crit fatal mysql mysqld_safe
[...] # about 200 lines of more loggrep customisation
loggrep_zmweb.graph_period minute
loggrep_zmweb.graph_order total count kernel cron munin muninalert unifi mongod initandlisten clientcursormon datafilesync conn exim zma zmb zmc zmd zms zmw zmpkg completed in out conv upsd upslost upsmon bash bashhist sudo info war warn missing err error crit fatal
[...] # other hosts
Because of the switch from cron to CGI, errors come less often on the node logs; but I still do a wget -r daily to highlight them, at least once a day. I don't see any more permanent critical states in loggrep; but if I visit the pages and zoom manually on 6AM, I cat see if the curve goes above 0; or I can check the MAX value for ERROR field (in loggrep graph).
I don't understand why loggrep on munin internal logs is not configured by default; munin is a monitoring tool, the first thing it should monitor is itself ... for sanity. If you don't have that, you have to manually inpect every single fraph for proper generation.
Any way, the graph still does not work better than last year, so, the bug keeps occuring with the above conf.
A quick check:
# cat /var/log/munin/munin-cgi-graph.log | grep reclaim-week.png
2018/03/08 06:53:41 [RRD ERROR] Unable to graph /var/lib/munin/cgi-tmp/munin-cgi-graph/doublehp.org/leon-01.doublehp.org/slab/reclaim-week.png : opening '/var/lib/munin/doublehp.org/leon-01.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd': No such file or directory
2018/03/08 06:53:41 [RRD ERROR] rrdtool 'graph' '/var/lib/munin/cgi-tmp/munin-cgi-graph/doublehp.org/leon-01.doublehp.org/slab/reclaim-week.png' \
2018/03/08 06:53:41 [WARNING] Could not draw graph "/var/lib/munin/cgi-tmp/munin-cgi-graph/doublehp.org/leon-01.doublehp.org/slab/reclaim-week.png": /var/lib/munin/cgi-tmp/munin-cgi-graph/doublehp.org/leon-01.doublehp.org/slab/reclaim-week.png
2018/03/08 06:53:41 [RRD ERROR] Unable to graph /var/lib/munin/cgi-tmp/munin-cgi-graph/doublehp.org/leon-03.doublehp.org/slab/reclaim-week.png : opening '/var/lib/munin/doublehp.org/leon-03.doublehp.org/slab-reclaim-slab_unreclaimable-g.rrd': No such file or directory
2018/03/08 06:53:41 [RRD ERROR] rrdtool 'graph' '/var/lib/munin/cgi-tmp/munin-cgi-graph/doublehp.org/leon-03.doublehp.org/slab/reclaim-week.png' \
2018/03/08 06:53:41 [WARNING] Could not draw graph "/var/lib/munin/cgi-tmp/munin-cgi-graph/doublehp.org/leon-03.doublehp.org/slab/reclaim-week.png": /var/lib/munin/cgi-tmp/munin-cgi-graph/doublehp.org/leon-03.doublehp.org/slab/reclaim-week.png
As said above, now, it happens only once, when I do the daily wget -r.
Hello! Have the same problem. Centos 7 + munin (cgi mode).
I would see this problem with old versions of the meminfo plugin, which had also used an abundance of arbitrary categories. It would cause munin-html to render not only those new categories on the left, but also invent new meminfo-related items on the main index page underneath each server, which in turn just led to broken pages.
In the meantime I just happened to drop all elements of meminfo other than appinfo and appgroupinfo, and those two don't happen to cause the bad HTML/graph links to be generated.
The configuration I used to reproduce was just a /etc/munin/plugin-conf.d/local-meminfo file that contained something mundane like:
[meminfo]
env.enabled_graphs (meminfo_phisical|appinfo|appgroupinfo)
env.applications_group apache:webservers;postgres:databases
env.applications (apache|postgres)
Had the same issue on Raspbian/Buster. *.rrd files exists but are not located at the expected directory and with a different filename.
I created the expected directory, and two symbolics links.
Example in my case:
mkdir -p /var/lib/munin/localdomain/localhost.localdomain
cd /var/lib/munin/localdomain/localhost.localdomain
ln -s ../localhost.localdomain-slab-reclaim-slab_reclaimable-g.rrd ./slab-reclaim-slab_reclaimable-g.rrd
ln -s ../localhost.localdomain-slab-unreclaim-slab_reclaimable-g.rrd ./slab-reclaim-slab_unreclaimable-g.rrd
chown -R munin:munin /var/lib/munin/localdomain/localhost.localdomain
Hello from 2022. Seems like this bug is still here. I got Munin 2.0.49 on Debian 10 and Munin shows no graphs. munin-graph.log shows
2022/10/25 03:25:14 [WARNING] Could not draw graph "/var/cache/munin/www/localdomain/localhost.localdomain/slab/reclaim-week.png": /var/cache/munin/www/localdomain/localhost.localdomain/slab/reclaim-week.png
2022/10/25 03:25:14 [RRD ERROR] Unable to graph /var/cache/munin/www/localdomain/localhost.localdomain/slab/reclaim-day.png : opening '/var/lib/munin/localdomain/localhost.localdomain/slab-reclaim-slab_unreclaimable-g.rrd': No such file or directory
while that rrd file actually exists as /var/lib/munin/localdomain/localhost.localdomain-slab-reclaim-slab_unreclaimable-g.rrd.
Any news on that?
I came across the old ticket (https://web.archive.org/web/20171006032217/http://munin-monitoring.org/ticket/1224) but that didn't help.
I linked the plugin from /etc to /usr; using default configuration. The error is repeated for all 4 png. For each node using it.
Munin version 2.0.33-1
munin-graph.log
Config:
Fetch: