netdata / netdata

Architected for speed. Automated for easy. Monitoring and troubleshooting, transformed!
https://www.netdata.cloud
GNU General Public License v3.0
71.96k stars 5.93k forks source link

Having trouble with headless collectors #5364

Closed eric-b-hymowitz closed 5 years ago

eric-b-hymowitz commented 5 years ago
Question summary

My headless slave node is not appearing on my master netdata dashboard menu.

OS / Environment

RHEL

I've just installed netdata for testing as a replacement (or augment) for nagios. I have it installed on one machine and it's great.

However, I'm trying to install netdata on a second machine ("cougar"), with the intent of using the first machine ("rolls-royce") as my sole dashboard/viewing host.

I believe I have followed the directions correctly from https://docs.netdata.cloud/streaming/ for setting up a "headless collector", where "cougar" is my "slave" instance and "rolls-royce" is my "master" instance.

I also figured out that I need to have my own "registry" because there is no direct Internet access from either host.

cougar netdata.conf

 [global]
     memory mode = none

 [web]
     mode = none

[registry]
    # enabled = no
    registry to announce = http://rolls-royce:19999

cougar stream.conf

[stream]
    enabled = yes
    destination = rolls-royce:19999
    api key = 9447dae1-0830-4edd-9e70-1cd125844b65
    timeout seconds = 60
    default port = 19999

rolls-royce netdata.conf

[registry]
    enabled = yes
    registry to announce = http://rolls-royce:19999

rolls-royce stream.conf

[stream]
    enabled = no

[9447dae1-0830-4edd-9e70-1cd125844b65]
    enabled = yes
    allow from = *
    default history = 3600
    default memory mode = save
    health enabled by default = auto
    multiple connections = allow

And I think I see data being collected in the logs, and cache files being created.

However, I cannot figure out how to view my "cougar" data from my "rolls-royce" dashboard.

The documentation refers to a "my-netdata" menu. I don't have a "my-netdata" menu. I have a menu entitled "rolls-royce", with only a single entry for "rolls-royce http://rolls-royce:19999/" but no entry for "cougar".

Can anybody help me figure out what I am missing?

cakrit commented 5 years ago

Hi @eric-b-hymowitz,

Your configuration looks fine. First, please check in the cougar netdata error.log for any errors related to streaming. The main question is whether cougar can actually establish an http connection to rolls-royce, which you can test with a simple curl request (you could just request the URL http://rolls-royce:19999/), or try to bring up the rolls-royce web UI itself from within cougar. The problem could be in hostname resolution, routing or a firewall.

If the connection is established properly and you get an access denied error, then you will want to look at the master's access lists Specifically, you will want to check the following two settings:

[web]
        # allow connections from = ...
        # allow streaming from = ...
eric-b-hymowitz commented 5 years ago

Thanks for following up.

I can definitely connect on the port. My cougar error.log seems normal:

2019-02-11 19:11:28: netdata INFO  : MAIN : Host 'cougar' (at registry as 'cougar') with guid '8365354a-2e10-11e9-ab26-000c29120d5e' initialized, os 'linux', timezone 'GMT', tags '', program_name 'netdata', program_version 'v1.12.0-17-g30f7324', update every 1, memory mode none, history entries 3996, streaming enabled (to 'rolls-royce:19999' with api key '9447dae1-0830-4edd-9e70-1cd125844b65'), health disabled, cache_dir '/opt/netdata/var/cache/netdata', varlib_dir '/opt/netdata/var/lib/netdata', health_log '/opt/netdata/var/lib/netdata/health/health-log.db', alarms default handler '/opt/netdata/usr/libexec/netdata/plugins.d/alarm-notify.sh', alarms default recipient 'root'
2019-02-11 19:11:29: netdata ERROR : PLUGIN[tc] : STREAM cougar [send]: not ready - discarding collected metrics.
2019-02-11 19:11:29: netdata INFO  : STREAM_SENDER[cougar] : thread created with task id 7445
2019-02-11 19:11:29: netdata INFO  : STREAM_SENDER[cougar] : STREAM cougar [send]: thread created (task id 7445)
2019-02-11 19:11:34: netdata INFO  : STREAM_SENDER[cougar] : STREAM cougar [send to rolls-royce:19999]: connecting...
2019-02-11 19:11:34: netdata INFO  : STREAM_SENDER[cougar] : STREAM cougar [send to rolls-royce:19999]: initializing communication...
2019-02-11 19:11:34: netdata INFO  : STREAM_SENDER[cougar] : STREAM cougar [send to rolls-royce:19999]: waiting response from remote netdata...
2019-02-11 19:11:34: netdata INFO  : STREAM_SENDER[cougar] : STREAM cougar [send to rolls-royce:19999]: established communication - ready to send metrics...
2019-02-11 19:11:34: netdata INFO  : PLUGIN[cgroups] : STREAM cougar [send]: sending metrics...

My rolls-royce error.log also seems normal:

2019-02-11 19:04:59: netdata INFO  : WEB_SERVER[static5] : clients wants to STREAM metrics.
2019-02-11 19:04:59: netdata INFO  : STREAM_RECEIVER[cougar,[localhost]:37282] : thread created with task id 3797
2019-02-11 19:04:59: netdata INFO  : STREAM_RECEIVER[cougar,[localhost]:37282] : STREAM cougar [localhost]:37282: receive thread created (task id 3797)
2019-02-11 19:04:59: netdata INFO  : STREAM_RECEIVER[cougar,[localhost]:37282] : Host 'cougar' (at registry as 'cougar') with guid '8365354a-2e10-11e9-ab26-000c29120d5e' initialized, os 'linux', timezone 'GMT', tags '', program_name 'netdata', program_version 'v1.12.0-17-g30f7324', update every 1, memory mode save, history entries 3996, streaming disabled (to '' with api key ''), health enabled, cache_dir '/opt/netdata/var/cache/netdata/8365354a-2e10-11e9-ab26-000c29120d5e', varlib_dir '/opt/netdata/var/lib/netdata/8365354a-2e10-11e9-ab26-000c29120d5e', health_log '/opt/netdata/var/lib/netdata/8365354a-2e10-11e9-ab26-000c29120d5e/health/health-log.db', alarms default handler '/opt/netdata/usr/libexec/netdata/plugins.d/alarm-notify.sh', alarms default recipient 'root'
2019-02-11 19:04:59: netdata INFO  : STREAM_RECEIVER[cougar,[localhost]:37282] : STREAM cougar [receive from [localhost]:37282]: initializing communication...
2019-02-11 19:04:59: netdata INFO  : STREAM_RECEIVER[cougar,[localhost]:37282] : Postponing health checks for 60 seconds, on host 'cougar', because it was just connected.
2019-02-11 19:04:59: netdata INFO  : STREAM_RECEIVER[cougar,[localhost]:37282] : STREAM cougar [receive from [localhost]:37282]: receiving metrics...

The master's [web] section is entirely commented-out defaults, such as:

[web]
    # default port = 19999
    # bind to = *
    # allow connections from = localhost *
    # allow streaming from = *

This might mean something. I just discovered that I should be able to view http://rolls-royce:19999/host/cougar

When I go to that page, I see unformatted text data -- as if the JS and/or CSS isn't running correctly. I don't know if that's related or not. I can't seem to post a screen-shot ("Something went really wrong, and we can't process that file") but I'll keep trying.

--EbH

cakrit commented 5 years ago

On the right hand side that shows all the chart categories, do you see a link for 'cougar'? See the following screenshot:

screenshot from 2019-02-12 14-10-07

The menu shows my localhost.localdomain under 'databases streamed to this host'. This is how the VM qemu fedora29 you see on the right side is accessible from my master. I replicated your configuration precisely, down to the use of the local registry and the headless part. So you shouldn't have to enter the URL manually at all. Can you do a Ctrl-Shift-R and show me a similar screenshot from rolls-royce? In the meantime, I'll look into how that link is added.

cakrit commented 5 years ago

http://rolls-royce:19999/api/v1/info should show cougar under mirrored_hosts. Can you paste that section here?

eric-b-hymowitz commented 5 years ago

I do not have a link on the right side for "cougar" that matches your "qemu fedora29".

I also do not have a "databases streamed to this host" section under my menu, just the "My nodes" section.

[Bad screenshot removed]

http://rolls-royce:19999/api/v1/info tells me this:

{
    "version": "v1.12.0-17-g30f7324",
    "uid": "c974dfc0-2e08-11e9-b6c1-001018afde44",
    "mirrored_hosts": [
        "rolls-royce",
        "cougar"
    ],
    "alarms": {
        "normal": 180,
        "warning": 0,
        "critical": 0
    }
}
eric-b-hymowitz commented 5 years ago

It didn't occur to me to ask this, but can I safely assume that all data flow is over TCP port 19999 from cougar to rolls-royce? My environment is pretty heavily loaded with firewalls and access restrictions, so if there is behind-the-scenes UDP activity or something else I don't expect, that could be causing my problems.

cakrit commented 5 years ago

Ok, the issue has nothing to do with streaming. There's a problem with the UI here. I don't see any charts for rolls-royce, your page should be full of them. Do you have any strange security settings in your browser for javascript? Try using the Chrome console to see what's wrong.

eric-b-hymowitz commented 5 years ago

That was my fault. I took a bad screenshot. here is a new one. The graphs are fine.

netdata3

cakrit commented 5 years ago

I detected a bug that affects what you see in the menu. It happened to me too. It does NOT affect the right hand side though, I still don't get why cougar doesn't appear there. I'll do some more digging.

cakrit commented 5 years ago

Going a bit blindly here, based on past issues. Can you do a cat /var/lib/netdata/registry/netdata.public.unique.id on both machines to ensure that you have different machine guids? Will look for more ideas.

cakrit commented 5 years ago

Also, can you do this on the master so we can ensure that the metrics are being collected? ls -l /var/cache/netdata/ | grep cougar

eric-b-hymowitz commented 5 years ago

cougar /var/lib/netdata/registry/netdata.public.unique.id 8365354a-2e10-11e9-ab26-000c29120d5e

rolls-royce /var/lib/netdata/registry/netdata.public.unique.id c974dfc0-2e08-11e9-b6c1-001018afde44

/var/cache/netdata

Nothing with the name "cougar". However, I do have cougar's uuid

/var/cache/netdata/8365354a-2e10-11e9-ab26-000c29120d5e

which is filled with subdirectories such as

cpu.cpu0
disk.dm_0
ipv4.packets
netdata.requests
system.cpu

(By the way, thanks again for all of your help. I appreciate it.)

cakrit commented 5 years ago

Ok, so I went through the code and my setup is not identical, since my slave is on the same host (just a VM). Please do on rolls-royce the change seen in https://github.com/netdata/netdata/pull/5371/files, you will probably have main.js under /usr/share/netdata/web/. And do a hard refresh on the rolls-royce UI. This should show cougar on your menu, under 'Databases streamed to this agent'. After you see it and click it, let's see if you still have an issue.

eric-b-hymowitz commented 5 years ago

That solved it. Now I have the "Databases streamed to this agent" section on the menu, with both nodes listed. Thank you very much.

tctovsli commented 5 years ago

I had this similar issue, even with the patch mentioned in pull-request 5371. I solved it by enabling registry, even though the documentation on https://docs.netdata.cloud/streaming/ never mention this. Is this necessary, or should the hosts appear automatically?

cakrit commented 5 years ago

Please open a new issue, because it obviously can't be the same root cause. Provide in that issue your master and slave configuration just as in the OP here, and ensure that you have connectivity and are not receiving errors in the logs. Using the global registry is not required, but you do need to have a registry, even if its the master that serves it.