Closed hugalafutro closed 3 years ago
I face this issue as well.
Same here.
Yesterday I have also installend a fresh librenms docker environment with docker compose. no errors. Installation Validation fine (same as by @hugalafutro ). I am sure, there is missing something.
Example of no data device:
HI
I have the same issue with application graphs (NTP Server, NTP Client) being impacted. This is in a non-Docker environment on Ubuntu. Happy to add specific details if that helps.
Kind regards
@nathanaelpearson that is strange as I run both docker and normal install (different machines) on ubuntu 20 and this issue only happens on docker for me. The normal install is pretty much flawless.
After updating the docker install to 1.68 the graphs now seem to load correctly elsewhere apart from in the Top Devices widgets in dashboard (strangely enough, not for Top Devices - Poller).
English is not my native language - I'm having trouble describing exactly what happens so I made a small video (i.e. I do not think this is the same issue as @Minocula has - your issue seems to be the docker can't dns resolve the hosts, I had to add some hosts to /etc/hosts on the docker machine myself, funnily enough the docker machine itself too which might be indicative of some underlying network issue, but I digress): youtube link
In 1.66 docker install this same would happen on any screen with many detail graphs such as smart application or ports, but the missing graphs would be blank, not red; and on reloading the page they'd load and random set of others would become blank. This is no longer happening to me.
As previously stated if any logs are useful I'll supply them, however there are no errors reported in any of the containers while this happens as far as I can see.
Hi This is what i'm getting but only for apps:-
When you click on a mini-graph I get the following:
From your description I wonder if these are two different issues with the same symptoms.
Kind regards
@nathanaelpearson I think your issue is with application not with graph drawing, are you sure the data is getting to nms and the app works? Because in your first picture the graph drawing works as it draws the axis and numbers, but the app either didn't write any data into the rrd or the php responsible for interpreting it has trouble (recently from apps I use pi-hole, apcupsd and unbound app had similar issues).
The red mini-graphs then make sense as there is no long-term data to create the graph - iirc, this is exactly how it behaves when autodetecting/manually enabling an app, but before it populates its RRD with data from polling cycle, which is a state you're perpetually locked in if there is trouble with getting the data from the app on the host into the rrd.
@hugalafutro
After updating the docker install to 1.68 the graphs now seem to load correctly elsewhere apart from in the Top Devices widgets in dashboard (strangely enough, not for Top Devices - Poller).
I think this has been fixed in 1.68 through librenms/librenms#12152
That's my pr (2nd ever lol) but that was only for unbound application. the youtube video I posted was taken from docker install updated to 1.68, where the unbound app started drawing queries graph, but the graphs in dashboard still randomly load as red. Also the mini graphs as in @nathanaelpearson picture (the 6h 24h 48h One Week Two Week graphs)
@hugalafutro Ok I watched your video.
I had to add some hosts to /etc/hosts on the docker machine myself, funnily enough the docker machine itself too which might be indicative of some underlying network issue, but I digress):
Wonder what host you have added in your hosts file to make it work? Maybe rrdcached should be embedded in the main Docker image to avoid this kind of issue :thinking:. This is the only difference I can see compared to a classic installation.
nms is the host on which the docker runs and the librenms installed in the docker on it would not snmp discover the nms host until I added 127.0.1.1 nms
into /etc/hosts (No idea why that particular address, but that made it work, I believe that was the ip of nms.lan when I pinged it from inside the docker container).
so now my /etc/hosts looks like:
127.0.0.1 localhost
127.0.1.1 nms
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
which makes nms pingable from every machine on the network.
I can't reboot the host atm and test, but iirc adding nms to the 127.0.0.1 line worked too, but caused other issue I can't remember.
I'm not really sure whether it's related as I'm not all that technologically proficient ;)
@crazy-max on my own docker image, i have integrated rrdcached in main librenms, and used the socket connection. there all is good, since i try to use official docker everywere, i changed a test installation to this one. here i get also those "red blocks". as far as i have seen it is because of the network connection to rrdcache, there is something not working optimal, maybe some flushing? at least i think this is an rrdcached issue, but can be fixed if used with sockets. maybe we could share the socket and try to use this connection?
@hugalafutro
After updating the docker install to 1.68 the graphs now seem to load correctly elsewhere apart from in the Top Devices widgets in dashboard (strangely enough, not for Top Devices - Poller).
I think this has been fixed in 1.68 through librenms/librenms#12152
HI Thanks for all the update. I've checked other app polling and this only seems to affect NTP server and NTP client graphs for me. That issue remains after the last update.
Kind regards
I've tested to share the rrdcached socket. It felt slightly better, but I also get the red blocks. I can only get rid of them if I disable rrdcached in the docker setup in librenms.env
I've tested to share the rrdcached socket. It felt slightly better, but I also get the red blocks. I can only get rid of them if I disable rrdcached in the docker setup in librenms.env
Disabling rrdcached in the .env file worked here too. I assume it might be an issue for large nms installs, but I with my less than 20 devices I notice no difference in speed and the damn red blocks are finally gone first time since I've converted to docker install.
Thanks for the workaround!
I just tried the latest image, removed the old rrdcached variables in librenms.env
and added the new one, rebuilt the container and the random red block on graphs are still there :(
Disabling rrdcached by commenting out the RRDCACHED_SERVER=rrdcached:42217
line in librenms.env
seems still the only way to get rid of them.
As someone who only monitors his home network with <20 devices do I really need rrdcached? Perhaps a placebo effect, but I feel the whole thing is snappier without rrdcached.
EDIT:
Seems I spoke too soon. The red blocks returned after some time in 1.69 even with the changes outlined above and rrdcached container commented out from the docker-compose.yml
. Fixed by pulling librenms/librenms:1.68
while keeping the rrdcached disabled. Wouldn't that indicate the issue is elsewhere than in rrdcached though?
Are you guys absolutely sure you aren't exhausting the php-fpm workers?
I just tried the latest image, removed the old rrdcached variables in
librenms.env
and added the new one, rebuilt the container and the random red block on graphs are still there :(Disabling rrdcached by commenting out the
RRDCACHED_SERVER=rrdcached:42217
line inlibrenms.env
seems still the only way to get rid of them.As someone who only monitors his home network with <20 devices do I really need rrdcached? Perhaps a placebo effect, but I feel the whole thing is snappier without rrdcached.
EDIT: Seems I spoke too soon. The red blocks returned after some time in 1.69 even with the changes outlined above and rrdcached container commented out from the
docker-compose.yml
. Fixed by pullinglibrenms/librenms:1.68
while keeping the rrdcached disabled. Wouldn't that indicate the issue is elsewhere than in rrdcached though?
Check inside of the librenms container with docker exec -it librenms bash
and look inside of cd /opt/librenms/config.d
for rrdcached.php
. The second I deleted that file, graphs work perfectly.
there is an error on commit 84083b8 so the rrdcached.php in config file will never be empty. see comment there.
@setiseta @CameronMunroe @hugalafutro
I can only get rid of them if I disable rrdcached in the docker setup in librenms.env
Check inside of the librenms container with docker exec -it librenms bash and look inside of cd /opt/librenms/config.d for rrdcached.php. The second I deleted that file, graphs work perfectly.
Disabling rrdcached in the .env file worked here too.
Removing the RRDCACHED_SERVER
env var from librenms.env
solved the issue for me as well. Any idea what could disrupt graphs in LibreNMS PHP code @murrant if a remote rrdcached server is enabled?
Ok it looks like we don't need the rrdcached service anymore. rrd data are populated directly with LibreNMS but I wonder why @murrant?
@crazy-max it was never required, just for better performance on bigger environments
@setiseta Ok thanks for the info. So it looks like an issue with remote rrdcached server through LibreNMS impl.
@crazy-max It was put in place because of performance.
Question is are you putting the base_options variable in place, as per the documentation. When I looked at your docker image for rrdcached I didn't see them set.
BASE_OPTIONS="-B -F -R"
https://docs.librenms.org/Extensions/RRDCached/#rrdcached-installation-ubuntu-16
@crazy-max you only have one rrdcached instance total correct? (rrdcached does not play nicely with multiple instances)
@murrant Yes only one through this container.
@crazy-max took a look at the config
@murrant
Why is it not listening on a network socket?
Because it can be used on a cluster so we need a network address. The rrdcached service exposes port 42217 and is named rrdcached in our stack so we define the rrdcached server with rrdcached:42217
in the config file.
Is JITTER supposed to be delay? It should have -z before it in the command.
Yes defined here with -z flag if filled.
@crazy-max I see the jitter now, but rrdcached is not listening on 42217
-l /var/run/rrdcached/rrdcached.sock
https://github.com/crazy-max/docker-rrdcached/blob/6594a91d480c702daff56dc2b781a23ee5099c5c/rootfs/etc/cont-init.d/04-svc-main.sh#L24
Haven't really had time to play with this lately, but finally found some yesterday, spun up new container without rrdcached and all seems well! (even with 1min polling interval).
For myself this issue could be closed, but I'll leave that up to you as the issue presumably still exists while using rrdcached.
Hi all,
I now have the same issue with graphs intermittently loading displaying "Error Drawing Graph". I can sit there hitting f5 and watch as different graphs load. << This context is to ensure we are talking about the same issue.
I am testing on a fresh docker-compose build, here my testing process: 1) Build/run the stack using docker-compose 2) Add devices to the stack and wait for some RRD files to be generated confirmed by the logs 3) Hover over the graphs, do they load? Do they load on refresh?
Each test i spin-up a new Maria DB to keep testing consistent.
I can confirm graphs load correctly in 1.68, 1.69 and 1.70.1 but seem to regress in 21.1.0 and the graphs are intermittent in loading in this version.
Its worth noting the red boxes are back in 1.70.1 but the main graphs do load correctly.
@crazy-max , @murrant any ideas why this issue may have resurfaced?
Yes its the same. The red boxes changed since last update to the text "Error Drawing Graph". It seems there is something not ok with the rrdcached connect. First i thought it is the tcp socket of the rrdcached, because i earlier used a unix socket on my own container, an there it worked. But even if I've shared the unix socket to the different containers the red boxes apears randomly. So from my opinion it is not librenms base, and not a docker concern, since its running on my own container (but with rrdcached in the same container as the web) So it seems it is this docker setup with the seperated containers, etc. but I've not found any hints to keep on searching. Maybe @crazy-max or @murrant can give some hints how / where to search to resolve this.
On my installation it also have not worked with the older versions, if rrdached was configured the red boxes apeared.
If you see error drawing graph, click on graph. Then click show command. It will show the error.
I can't see the error, am I on the wrong page?
I think its only a random single request which failes, if i refresh, the graph is ok. but on the dashboard with alot graph there is often one graph not ok.
and i think the rrdtool command and the output there is processed in a seperate request, or is it the same?
I feel like the rrd command is working and returning the data and there is something we are missing here with the way the data is being presented.
To be clear my RRD command does not fail, it returns the data I expect but the graph is displaying "Error Drawing Graph" like @setiseta mentions above.
I'll keep digging as I have this issue in production now because of the CVE patch in 21.1.0.
For awareness I have this issue with a libre stack running RRD 1.5.5 on a dedicated remote host and with my local docker dev container setup using RRD 1.7.2.
Perhaps the temporary file is missing? (note that it is deleted immediately after being served)
Yeah I started looking at that last night, I think it may be a resurfacing of https://github.com/librenms/docker/issues/51 which is also mentioned here - https://community.librenms.org/t/rrdtool-race-condition/12061
As in the forum post above I am unable to find the PR that actually fixed (or workaround) the rrd race condition issue and I wonder if https://github.com/librenms/librenms/pull/11865 has had a negative impact on the race condition which would only impact remote RRD installs as per the previous diagnosis by @dennypage.
I presume all you new people with this issue are using rrdcached? Because ever since I turned it off my install has been displaying all graphs with no errors whatsoever since https://github.com/librenms/docker/issues/124#issuecomment-730311049 and I can refresh any page with any number of graphs and they never load as red blocks.
@hugalafutro Yes this issue is specifically about the rrdcached sidecar container. The main example does not use the rrdcached sidecar container.
@hugalafutro since it was default with rrdcached I think all are using rrdcached. The default changed about 2month ago, to not include rrdcached.
@setiseta @crazy-max I see, that explains it I just got spooked the issue returned because of all the replies. I'll keep an eye on the thread nonetheless so I can turn rrdcached back on when it gets resolved. Although I only monitor ~15 devices I'd like the install running "as nature intended" without disabling sidecars.
@hugalafutro Sure, I will try to find some time to fix the RRDCached image based on @murrant comment https://github.com/librenms/docker/issues/124#issuecomment-730084131
I highly doubt https://github.com/librenms/librenms/pull/11865 has an impact basically it fixed the images so they aren't all red and you can actually read the text that was red on a red background before.
@murrant
I see the jitter now, but rrdcached is not listening on 42217
-l /var/run/rrdcached/rrdcached.sock
https://github.com/crazy-max/docker-rrdcached/blob/6594a91d480c702daff56dc2b781a23ee5099c5c/rootfs/etc/cont-init.d/04-svc-main.sh#L24
I have made some tests and RRDCached daemon actually listening on port 42217 through -L
flag so don't think that's the issue here.
$ docker-compose exec rrdcached netstat -aptn
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:42217 0.0.0.0:* LISTEN -
...
I have some time this week to look into this one again.
Any suggestions what may have changed after 1.66 that may have introduced this bug? Happy to try and track down any hunches.
Currently testing, tag 1.66 works as expected, 21.3.0 graphs intermittently fail with "Error Drawing Graph". The RRDdata appears to be returned but graphs still display the error.
Okay, today I started printing the output of the rrdtool_graph command when there is a "bad graph drawing" in includes/html/graphs/graph.inc.php. On "bad" requests the returned value from the RRD command is not consistent with a successful good request.
Good Output:
1593x344 OK u:0.06 s:0.02 r:0.09
Bad Output:
1617799200 OK u:0.00 s:0.00 r:0.00
If I manually take the same rrdtool graph command and run it I get a similar output to the "Good Output"
1593x344
.
This issue has been mentioned on LibreNMS Community. There might be relevant details there:
https://community.librenms.org/t/docker-stats-application-not-drawing-graphs/15329/7
Should be fixed with librenms/librenms#12746
Behaviour
Graphs load as red block randomly
Steps to reproduce this issue
Expected behaviour
The graphs should all load
Actual behaviour
The graphs randomly do not load
Configuration
docker --version
) : Docker version 19.03.12, build 48a66213fedocker-compose --version
) : docker-compose version 1.25.0, build unknownuname -a
) : Linux nms 5.4.0-47-generic #51-Ubuntu SMP Fri Sep 4 19:50:52 UTC 2020 x86_64 x86_64 x86_64 GNU/LinuxDocker info
Logs
Validate:
There is also a discussion with some screenshots @ https://community.librenms.org/t/problems-with-graphs-red-only-missing-error-drawing-graph/10279