librenms / docker

LibreNMS Docker image
MIT License
663 stars 275 forks source link

Graphs randomly fail to load with RRDCached sidecar container #124

Closed hugalafutro closed 3 years ago

hugalafutro commented 4 years ago

Behaviour

Graphs load as red block randomly

Steps to reproduce this issue

  1. use docker librenms install
  2. refresh a page with any number of graphs (for me the most obvious culprits are the "Top Devices" widgets on dashboard, but it happens everywhere
  3. observe as every refresh random number of the graphs doesn't load and look like red square, while some of the previously not loaded will load and display correct data

Expected behaviour

The graphs should all load

Actual behaviour

The graphs randomly do not load

Configuration

Docker info

Client:
 Debug Mode: false

Server:
 Containers: 21
  Running: 20
  Paused: 0
  Stopped: 1
 Images: 19
 Server Version: 19.03.12
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-47-generic
 Operating System: Ubuntu 20.04.1 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.661GiB
 Name: nms
 ID: ZYME:SMME:SOUR:BI62:XXOT:LR2H:THH4:FOAJ:IGJC:QXQZ:ZNW3:7RQE
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Logs

Nothing in any of librenms containers logs indicate any error.

Validate:

====================================
Component | Version
--------- | -------
LibreNMS  | 1.66
DB Schema | 2020_08_28_212054_drop_uptime_column_outages (173)
PHP       | 7.3.21
Python    | 3.8.5
MySQL     | 10.4.14-MariaDB-1:10.4.14+maria~focal
RRDTool   | 1.7.2
SNMP      | NET-SNMP 5.8
====================================

[OK]    Installed from the official Docker image; no Composer required
[OK]    Database connection successful
[OK]    Database schema correct
[WARN]  IPv6 is disabled on your server, you will not be able to add IPv6 devices.
[WARN]  Updates are managed through the official Docker image

There is also a discussion with some screenshots @ https://community.librenms.org/t/problems-with-graphs-red-only-missing-error-drawing-graph/10279

Munzy commented 4 years ago

I face this issue as well.

rigocalin commented 4 years ago

Same here.

Minocula commented 4 years ago

Yesterday I have also installend a fresh librenms docker environment with docker compose. no errors. Installation Validation fine (same as by @hugalafutro ). I am sure, there is missing something.

librenms poller status

Example of no data device: librenms no graph data example

nathanaelpearson commented 3 years ago

HI

I have the same issue with application graphs (NTP Server, NTP Client) being impacted. This is in a non-Docker environment on Ubuntu. Happy to add specific details if that helps.

Kind regards

hugalafutro commented 3 years ago

@nathanaelpearson that is strange as I run both docker and normal install (different machines) on ubuntu 20 and this issue only happens on docker for me. The normal install is pretty much flawless.

After updating the docker install to 1.68 the graphs now seem to load correctly elsewhere apart from in the Top Devices widgets in dashboard (strangely enough, not for Top Devices - Poller).

English is not my native language - I'm having trouble describing exactly what happens so I made a small video (i.e. I do not think this is the same issue as @Minocula has - your issue seems to be the docker can't dns resolve the hosts, I had to add some hosts to /etc/hosts on the docker machine myself, funnily enough the docker machine itself too which might be indicative of some underlying network issue, but I digress): youtube link

In 1.66 docker install this same would happen on any screen with many detail graphs such as smart application or ports, but the missing graphs would be blank, not red; and on reloading the page they'd load and random set of others would become blank. This is no longer happening to me.

As previously stated if any logs are useful I'll supply them, however there are no errors reported in any of the containers while this happens as far as I can see.

nathanaelpearson commented 3 years ago

Hi This is what i'm getting but only for apps:-

image

When you click on a mini-graph I get the following: image

From your description I wonder if these are two different issues with the same symptoms.

Kind regards

hugalafutro commented 3 years ago

@nathanaelpearson I think your issue is with application not with graph drawing, are you sure the data is getting to nms and the app works? Because in your first picture the graph drawing works as it draws the axis and numbers, but the app either didn't write any data into the rrd or the php responsible for interpreting it has trouble (recently from apps I use pi-hole, apcupsd and unbound app had similar issues).

The red mini-graphs then make sense as there is no long-term data to create the graph - iirc, this is exactly how it behaves when autodetecting/manually enabling an app, but before it populates its RRD with data from polling cycle, which is a state you're perpetually locked in if there is trouble with getting the data from the app on the host into the rrd.

crazy-max commented 3 years ago

@hugalafutro

After updating the docker install to 1.68 the graphs now seem to load correctly elsewhere apart from in the Top Devices widgets in dashboard (strangely enough, not for Top Devices - Poller).

I think this has been fixed in 1.68 through librenms/librenms#12152

hugalafutro commented 3 years ago

That's my pr (2nd ever lol) but that was only for unbound application. the youtube video I posted was taken from docker install updated to 1.68, where the unbound app started drawing queries graph, but the graphs in dashboard still randomly load as red. Also the mini graphs as in @nathanaelpearson picture (the 6h 24h 48h One Week Two Week graphs)

crazy-max commented 3 years ago

@hugalafutro Ok I watched your video.

I had to add some hosts to /etc/hosts on the docker machine myself, funnily enough the docker machine itself too which might be indicative of some underlying network issue, but I digress):

Wonder what host you have added in your hosts file to make it work? Maybe rrdcached should be embedded in the main Docker image to avoid this kind of issue :thinking:. This is the only difference I can see compared to a classic installation.

hugalafutro commented 3 years ago

nms is the host on which the docker runs and the librenms installed in the docker on it would not snmp discover the nms host until I added 127.0.1.1 nms into /etc/hosts (No idea why that particular address, but that made it work, I believe that was the ip of nms.lan when I pinged it from inside the docker container).

so now my /etc/hosts looks like:

127.0.0.1 localhost
127.0.1.1 nms

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

which makes nms pingable from every machine on the network.

I can't reboot the host atm and test, but iirc adding nms to the 127.0.0.1 line worked too, but caused other issue I can't remember.

I'm not really sure whether it's related as I'm not all that technologically proficient ;)

setiseta commented 3 years ago

@crazy-max on my own docker image, i have integrated rrdcached in main librenms, and used the socket connection. there all is good, since i try to use official docker everywere, i changed a test installation to this one. here i get also those "red blocks". as far as i have seen it is because of the network connection to rrdcache, there is something not working optimal, maybe some flushing? at least i think this is an rrdcached issue, but can be fixed if used with sockets. maybe we could share the socket and try to use this connection?

nathanaelpearson commented 3 years ago

@hugalafutro

After updating the docker install to 1.68 the graphs now seem to load correctly elsewhere apart from in the Top Devices widgets in dashboard (strangely enough, not for Top Devices - Poller).

I think this has been fixed in 1.68 through librenms/librenms#12152

HI Thanks for all the update. I've checked other app polling and this only seems to affect NTP server and NTP client graphs for me. That issue remains after the last update.

Kind regards

setiseta commented 3 years ago

I've tested to share the rrdcached socket. It felt slightly better, but I also get the red blocks. I can only get rid of them if I disable rrdcached in the docker setup in librenms.env

hugalafutro commented 3 years ago

I've tested to share the rrdcached socket. It felt slightly better, but I also get the red blocks. I can only get rid of them if I disable rrdcached in the docker setup in librenms.env

Disabling rrdcached in the .env file worked here too. I assume it might be an issue for large nms installs, but I with my less than 20 devices I notice no difference in speed and the damn red blocks are finally gone first time since I've converted to docker install.

Thanks for the workaround!

hugalafutro commented 3 years ago

I just tried the latest image, removed the old rrdcached variables in librenms.env and added the new one, rebuilt the container and the random red block on graphs are still there :(

Disabling rrdcached by commenting out the RRDCACHED_SERVER=rrdcached:42217 line in librenms.env seems still the only way to get rid of them.

As someone who only monitors his home network with <20 devices do I really need rrdcached? Perhaps a placebo effect, but I feel the whole thing is snappier without rrdcached.

EDIT: Seems I spoke too soon. The red blocks returned after some time in 1.69 even with the changes outlined above and rrdcached container commented out from the docker-compose.yml. Fixed by pulling librenms/librenms:1.68 while keeping the rrdcached disabled. Wouldn't that indicate the issue is elsewhere than in rrdcached though?

murrant commented 3 years ago

Are you guys absolutely sure you aren't exhausting the php-fpm workers?

CameronMunroe commented 3 years ago

I just tried the latest image, removed the old rrdcached variables in librenms.env and added the new one, rebuilt the container and the random red block on graphs are still there :(

Disabling rrdcached by commenting out the RRDCACHED_SERVER=rrdcached:42217 line in librenms.env seems still the only way to get rid of them.

As someone who only monitors his home network with <20 devices do I really need rrdcached? Perhaps a placebo effect, but I feel the whole thing is snappier without rrdcached.

EDIT: Seems I spoke too soon. The red blocks returned after some time in 1.69 even with the changes outlined above and rrdcached container commented out from the docker-compose.yml. Fixed by pulling librenms/librenms:1.68 while keeping the rrdcached disabled. Wouldn't that indicate the issue is elsewhere than in rrdcached though?

Check inside of the librenms container with docker exec -it librenms bash and look inside of cd /opt/librenms/config.d for rrdcached.php. The second I deleted that file, graphs work perfectly.

setiseta commented 3 years ago

there is an error on commit 84083b8 so the rrdcached.php in config file will never be empty. see comment there.

crazy-max commented 3 years ago

@setiseta @CameronMunroe @hugalafutro

I can only get rid of them if I disable rrdcached in the docker setup in librenms.env

Check inside of the librenms container with docker exec -it librenms bash and look inside of cd /opt/librenms/config.d for rrdcached.php. The second I deleted that file, graphs work perfectly.

Disabling rrdcached in the .env file worked here too.

Removing the RRDCACHED_SERVER env var from librenms.env solved the issue for me as well. Any idea what could disrupt graphs in LibreNMS PHP code @murrant if a remote rrdcached server is enabled?

crazy-max commented 3 years ago

Ok it looks like we don't need the rrdcached service anymore. rrd data are populated directly with LibreNMS but I wonder why @murrant?

setiseta commented 3 years ago

@crazy-max it was never required, just for better performance on bigger environments

crazy-max commented 3 years ago

@setiseta Ok thanks for the info. So it looks like an issue with remote rrdcached server through LibreNMS impl.

CameronMunroe commented 3 years ago

@crazy-max It was put in place because of performance.

Question is are you putting the base_options variable in place, as per the documentation. When I looked at your docker image for rrdcached I didn't see them set.

BASE_OPTIONS="-B -F -R"

https://docs.librenms.org/Extensions/RRDCached/#rrdcached-installation-ubuntu-16

crazy-max commented 3 years ago

@CameronMunroe Yes, see https://github.com/crazy-max/docker-rrdcached/blob/6594a91d480c702daff56dc2b781a23ee5099c5c/rootfs/etc/cont-init.d/04-svc-main.sh#L18-L34

murrant commented 3 years ago

@crazy-max you only have one rrdcached instance total correct? (rrdcached does not play nicely with multiple instances)

crazy-max commented 3 years ago

@murrant Yes only one through this container.

murrant commented 3 years ago

@crazy-max took a look at the config

  1. Why is it not listening on a network socket?
  2. Is JITTER supposed to be delay? It should have -z before it in the command.
crazy-max commented 3 years ago

@murrant

Why is it not listening on a network socket?

Because it can be used on a cluster so we need a network address. The rrdcached service exposes port 42217 and is named rrdcached in our stack so we define the rrdcached server with rrdcached:42217 in the config file.

Is JITTER supposed to be delay? It should have -z before it in the command.

Yes defined here with -z flag if filled.

murrant commented 3 years ago

@crazy-max I see the jitter now, but rrdcached is not listening on 42217 -l /var/run/rrdcached/rrdcached.sock https://github.com/crazy-max/docker-rrdcached/blob/6594a91d480c702daff56dc2b781a23ee5099c5c/rootfs/etc/cont-init.d/04-svc-main.sh#L24

hugalafutro commented 3 years ago

Haven't really had time to play with this lately, but finally found some yesterday, spun up new container without rrdcached and all seems well! (even with 1min polling interval).

nms-graphs-ok

For myself this issue could be closed, but I'll leave that up to you as the issue presumably still exists while using rrdcached.

Random6554 commented 3 years ago

Hi all,

I now have the same issue with graphs intermittently loading displaying "Error Drawing Graph". I can sit there hitting f5 and watch as different graphs load. << This context is to ensure we are talking about the same issue.

I am testing on a fresh docker-compose build, here my testing process: 1) Build/run the stack using docker-compose 2) Add devices to the stack and wait for some RRD files to be generated confirmed by the logs 3) Hover over the graphs, do they load? Do they load on refresh?

Each test i spin-up a new Maria DB to keep testing consistent.

I can confirm graphs load correctly in 1.68, 1.69 and 1.70.1 but seem to regress in 21.1.0 and the graphs are intermittent in loading in this version.

Its worth noting the red boxes are back in 1.70.1 but the main graphs do load correctly.

@crazy-max , @murrant any ideas why this issue may have resurfaced?

setiseta commented 3 years ago

Yes its the same. The red boxes changed since last update to the text "Error Drawing Graph". It seems there is something not ok with the rrdcached connect. First i thought it is the tcp socket of the rrdcached, because i earlier used a unix socket on my own container, an there it worked. But even if I've shared the unix socket to the different containers the red boxes apears randomly. So from my opinion it is not librenms base, and not a docker concern, since its running on my own container (but with rrdcached in the same container as the web) So it seems it is this docker setup with the seperated containers, etc. but I've not found any hints to keep on searching. Maybe @crazy-max or @murrant can give some hints how / where to search to resolve this.

On my installation it also have not worked with the older versions, if rrdached was configured the red boxes apeared.

murrant commented 3 years ago

If you see error drawing graph, click on graph. Then click show command. It will show the error.

setiseta commented 3 years ago

image I can't see the error, am I on the wrong page?

setiseta commented 3 years ago

I think its only a random single request which failes, if i refresh, the graph is ok. but on the dashboard with alot graph there is often one graph not ok.

and i think the rrdtool command and the output there is processed in a seperate request, or is it the same?

Random6554 commented 3 years ago

I feel like the rrd command is working and returning the data and there is something we are missing here with the way the data is being presented.

To be clear my RRD command does not fail, it returns the data I expect but the graph is displaying "Error Drawing Graph" like @setiseta mentions above.

I'll keep digging as I have this issue in production now because of the CVE patch in 21.1.0.

For awareness I have this issue with a libre stack running RRD 1.5.5 on a dedicated remote host and with my local docker dev container setup using RRD 1.7.2.

murrant commented 3 years ago

Perhaps the temporary file is missing? (note that it is deleted immediately after being served)

Random6554 commented 3 years ago

Yeah I started looking at that last night, I think it may be a resurfacing of https://github.com/librenms/docker/issues/51 which is also mentioned here - https://community.librenms.org/t/rrdtool-race-condition/12061

As in the forum post above I am unable to find the PR that actually fixed (or workaround) the rrd race condition issue and I wonder if https://github.com/librenms/librenms/pull/11865 has had a negative impact on the race condition which would only impact remote RRD installs as per the previous diagnosis by @dennypage.

hugalafutro commented 3 years ago

I presume all you new people with this issue are using rrdcached? Because ever since I turned it off my install has been displaying all graphs with no errors whatsoever since https://github.com/librenms/docker/issues/124#issuecomment-730311049 and I can refresh any page with any number of graphs and they never load as red blocks.

crazy-max commented 3 years ago

@hugalafutro Yes this issue is specifically about the rrdcached sidecar container. The main example does not use the rrdcached sidecar container.

setiseta commented 3 years ago

@hugalafutro since it was default with rrdcached I think all are using rrdcached. The default changed about 2month ago, to not include rrdcached.

hugalafutro commented 3 years ago

@setiseta @crazy-max I see, that explains it I just got spooked the issue returned because of all the replies. I'll keep an eye on the thread nonetheless so I can turn rrdcached back on when it gets resolved. Although I only monitor ~15 devices I'd like the install running "as nature intended" without disabling sidecars.

crazy-max commented 3 years ago

@hugalafutro Sure, I will try to find some time to fix the RRDCached image based on @murrant comment https://github.com/librenms/docker/issues/124#issuecomment-730084131

murrant commented 3 years ago

I highly doubt https://github.com/librenms/librenms/pull/11865 has an impact basically it fixed the images so they aren't all red and you can actually read the text that was red on a red background before.

crazy-max commented 3 years ago

@murrant

I see the jitter now, but rrdcached is not listening on 42217 -l /var/run/rrdcached/rrdcached.sock https://github.com/crazy-max/docker-rrdcached/blob/6594a91d480c702daff56dc2b781a23ee5099c5c/rootfs/etc/cont-init.d/04-svc-main.sh#L24

I have made some tests and RRDCached daemon actually listening on port 42217 through -L flag so don't think that's the issue here.

$ docker-compose exec rrdcached netstat -aptn
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:42217           0.0.0.0:*               LISTEN      -
...
Random6554 commented 3 years ago

I have some time this week to look into this one again.

Any suggestions what may have changed after 1.66 that may have introduced this bug? Happy to try and track down any hunches.

Currently testing, tag 1.66 works as expected, 21.3.0 graphs intermittently fail with "Error Drawing Graph". The RRDdata appears to be returned but graphs still display the error.

Random6554 commented 3 years ago

Okay, today I started printing the output of the rrdtool_graph command when there is a "bad graph drawing" in includes/html/graphs/graph.inc.php. On "bad" requests the returned value from the RRD command is not consistent with a successful good request.

Good Output: 1593x344 OK u:0.06 s:0.02 r:0.09

Bad Output: 1617799200 OK u:0.00 s:0.00 r:0.00

If I manually take the same rrdtool graph command and run it I get a similar output to the "Good Output" 1593x344.

librenms-bot commented 3 years ago

This issue has been mentioned on LibreNMS Community. There might be relevant details there:

https://community.librenms.org/t/docker-stats-application-not-drawing-graphs/15329/7

crazy-max commented 3 years ago

Should be fixed with librenms/librenms#12746