sensu / sensu-go

Simple. Scalable. Multi-cloud monitoring.
https://sensu.io
MIT License
1.02k stars 175 forks source link

Trying to understand adding new cluster member #1890

Closed treydock closed 4 years ago

treydock commented 6 years ago

This may be premature given how new the changes to sensuctl are to support adding and removing members. I created a two member cluster using backend.yml and starting everything from scratch.

backend1 bits:

listen-client-urls: "http://0.0.0.0:2379"
listen-peer-urls: "http://0.0.0.0:2380"
initial-cluster: "backend1=http://192.168.52.10:2380,backend2=http://192.168.52.11:2380"
initial-advertise-peer-urls: "http://192.168.52.10:2380"
initial-cluster-state: "new"
#initial-cluster-token: ""
name: "backend1"

backend2 bits

listen-client-urls: "http://0.0.0.0:2379"
listen-peer-urls: "http://0.0.0.0:2380"
initial-cluster: "backend1=http://192.168.52.10:2380,backend2=http://192.168.52.11:2380"
initial-advertise-peer-urls: "http://192.168.52.11:2380"
initial-cluster-state: "new"
#initial-cluster-token: ""
name: "backend2"

Member list looks good:

[root@sensu-backend ~]# sensuctl cluster member-list
         ID            Name             Peer URLs               Client URLs
 ────────────────── ────────── ─────────────────────────── ─────────────────────
  c7723b754183fa17   backend1   http://192.168.52.10:2380   http://0.0.0.0:2379
  e9aba17aa9930cc0   backend2   http://192.168.52.11:2380   http://0.0.0.0:2379

Where I'm running into trouble is adding a 3rd member to the cluster.

I tried two approaches in backend.yml

Approach one- match existing cluster:

listen-client-urls: "http://0.0.0.0:2379"
listen-peer-urls: "http://0.0.0.0:2380"
initial-cluster: "backend1=http://192.168.52.10:2380,backend2=http://192.168.52.11:2380,backend3=http://192.168.52.12:2380"
initial-advertise-peer-urls: "http://192.168.52.12:2380"
initial-cluster-state: "new"
#initial-cluster-token: ""
name: "backend3"

Approach two - no initial settings

listen-client-urls: "http://0.0.0.0:2379"
listen-peer-urls: "http://0.0.0.0:2380"
#initial-cluster: "backend1=http://192.168.52.10:2380,backend2=http://192.168.52.11:2380,backend3=http://192.168.52.12:2380"
#initial-advertise-peer-urls: "http://192.168.52.12:2380"
#initial-cluster-state: "new"
#initial-cluster-token: ""
#name: "backend3"

In both cases I added the member:

[root@sensu-backend ~]# sensuctl cluster member-add backend3 http://192.168.52.12:2380
added member 912d8b61de69d3f to cluster

ETCD_NAME="backend3"
ETCD_INITIAL_CLUSTER="backend3=http://192.168.52.12:2380,backend1=http://192.168.52.10:2380,backend2=http://192.168.52.11:2380"
ETCD_INITIAL_CLUSTER_STATE="existing"
[root@sensu-backend ~]# sensuctl cluster member-list
         ID            Name             Peer URLs               Client URLs
 ────────────────── ────────── ─────────────────────────── ─────────────────────
  912d8b61de69d3f               http://192.168.52.12:2380
  c7723b754183fa17   backend1   http://192.168.52.10:2380   http://0.0.0.0:2379
  e9aba17aa9930cc0   backend2   http://192.168.52.11:2380   http://0.0.0.0:2379

What happens is the logs on backend3 have this:

Aug  1 00:05:44 localhost sensu-backend: {"component":"etcd","level":"error","msg":"request cluster ID mismatch (got a1b0272bc1fc0b49 want 3b0efc7b379f89be)","pkg":"rafthttp","time}

I also notice that the member-list output seems incomplete, lacking name and client URLs for the new member.

I'm unable to identify with sensuctl where those mismatched values come from likely because the ID of the cluster is not correct with --format json which is covered in #1887

I'm working with @ghoneycutt to automate with Puppet and wanted to evaluate supporting member add and remove with Puppet code.

Your Environment

echlebek commented 6 years ago

After the cluster has booted, new members need to be added with initial-cluster-state: existing. You must set the configuration provided by the sensuctl cluster member-add tool, before launching the new member. This is crucial, as the member ID will be computed by hashing the member configuration.

We haven't released documentation on this yet, but it is similar to adding a new member to an etcd cluster.

There are some etcd docs you might find useful here: https://coreos.com/etcd/docs/latest/etcd-live-cluster-reconfiguration.html

treydock commented 6 years ago

@echlebek Maybe I'm doing something wrong but I'm now attempting adding a backend to a single node cluster using docker images that are used to test Puppet code.

I configure backend1:

---
listen-client-urls: http://0.0.0.0:2379
listen-peer-urls: http://0.0.0.0:2380
initial-cluster: backend1=http://172.17.0.2:2380
initial-advertise-peer-urls: http://172.17.0.2:2380
initial-cluster-state: new
name: backend1
[root@sensu_backend1 /]# sensuctl cluster member-list
         ID            Name           Peer URLs              Client URLs
 ────────────────── ────────── ──────────────────────── ─────────────────────
  8e7d2048a91042b2   backend1   http://172.17.0.2:2380   http://0.0.0.0:2379

I then add a new member before starting the new member's sensu-backend (using Puppet, command used is in output)

Debug: Executing: '/usr/bin/sensuctl cluster member-add backend2 http://172.17.0.3:2380'
Info: Cluster member-add backend2: added member 1708047987ff93 to cluster

Info: Cluster member-add backend2:

Info: Cluster member-add backend2: ETCD_NAME="backend2"

Info: Cluster member-add backend2: ETCD_INITIAL_CLUSTER="backend2=http://172.17.0.3:2380,backend1=http://172.17.0.2:2380"

Info: Cluster member-add backend2: ETCD_INITIAL_CLUSTER_STATE="existing"

Notice: /Stage[main]/Main/Sensu_cluster_member[backend2]/ensure: created

Now sensuctl commands hang and have to be timed out:

[root@sensu_backend1 /]# timeout 30 sensuctl cluster member-list
[root@sensu_backend1 /]# echo $?
124

It's not until I configure and start sensu-backend on the new member that sensuctl commands stop hanging:

[root@sensu_backend1 /]# timeout 30 sensuctl cluster member-list
         ID            Name           Peer URLs              Client URLs
 ────────────────── ────────── ──────────────────────── ─────────────────────
  1708047987ff93     backend2   http://172.17.0.3:2380   http://0.0.0.0:2379
  8e7d2048a91042b2   backend1   http://172.17.0.2:2380   http://0.0.0.0:2379
echlebek commented 6 years ago

Interesting, I haven't observed that behaviour with member-list before. However, my tests have been with a three-node cluster to start with, and then adding and removing members.

My previous tests with member-add have been adding a member to a cluster with two members, bringing it to three. In those cases, you could execute member-list before starting the new member without issue. The ID would simply not show for the member that wasn't started yet.

I'll try to reproduce your bug when I get back to work on Tuesday. Thanks for reporting this!

treydock commented 6 years ago

If etcd bases cluster availability and health on quorum then my guess is a single node cluster adding a new member puts the system in a state where quorum is lost. I'm unable to reproduce the problem using a similar setup as you described, going from 2 node to 3 node cluster.

I have noticed that establishing a 2 node cluster by doing one node at a time causes some problems. I do backend1 then backend2. The issues are when only backend1 is running. First I notice that curl http://127.0.0.1/info fails, which I use in Puppet to verify sensu-backend is fully booted:

sensu_backend1 12:43:43$ curl http://127.0.0.1:8080/info
    % Total    % Received % Xferd  Average Speed   Time      Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    0     0    0     0    0     0      0        0 --:--:-- --:--:-- --:--:--     0curl: (7) Failed connect to 127.0.0.1:8080; Connection refused

The sensu-backend service is running when curl is attempted. Once backend2 node has sensu-backend started the curl works:

sensu_backend1 12:44:24$ curl http://127.0.0.1:8080/info
{"agentd":true,"apid":true,"dashboardd":true,"eventd":true,"keepalived":true,"message_bus":true,"pipelined":true,"schedulerd":true,"store":true}

Also seems sensu-agent on backend1 crashes after being started with only backend1 running. The service starts and stays running just fine once backend2 is brought online. I presume this is because the etcd cluster is not in a healthy state.

sensu_backend1 12:44:24$ systemctl status sensu-agent -l
  ● sensu-agent.service - The Sensu Agent process.
     Loaded: loaded (/usr/lib/systemd/system/sensu-agent.service; enabled; vendor preset: disabled)
     Active: failed (Result: start-limit) since Sun 2018-08-05 16:43:43 UTC; 41s ago
    Process: 2546 ExecStart=/usr/bin/sensu-agent start (code=exited, status=1/FAILURE)
   Main PID: 2546 (code=exited, status=1/FAILURE)

  Aug 05 16:43:42 sensu_backend1 systemd[1]: Unit sensu-agent.service entered failed state.
  Aug 05 16:43:42 sensu_backend1 systemd[1]: sensu-agent.service failed.
  Aug 05 16:43:43 sensu_backend1 systemd[1]: sensu-agent.service holdoff time over, scheduling restart.
  Aug 05 16:43:43 sensu_backend1 systemd[1]: start request repeated too quickly for sensu-agent.service
  Aug 05 16:43:43 sensu_backend1 systemd[1]: Failed to start The Sensu Agent process..
  Aug 05 16:43:43 sensu_backend1 systemd[1]: Unit sensu-agent.service entered failed state.
  Aug 05 16:43:43 sensu_backend1 systemd[1]: sensu-agent.service failed.

The environment producing above output is docker with no syslog so if the above isn't expected I can try and reproduce in a more complete environment where syslog logging is installed and functioning.

csoleimani commented 5 years ago

I've been able to add new members to the cluster. It's a little tricky, and have been doing it manually instead of with puppet, but I was finally able to get it working. @treydock I'd be willing to do a zoom call with you to see your process sometime next week during the workday. Feel free to DM me - it's the least I can do since you've jumped on my sensu-puppet repo issues so quickly :)

calebhailey commented 4 years ago

I'm going to go ahead and close this one as resolved via the Sensu Clustering guide: https://docs.sensu.io/sensu-go/latest/guides/clustering/

The original issue here was opened pre-GA, before we had documented the steps to cluster Sensu.

@csoleimani thanks for chiming in to help with this one! Let us know if you need any assistance with automated clustering and we'll be happy to help!