etcd-manager can get into state where it doesn't start etcd on some nodes

adammw commented 4 years ago

Currently seeing our cluster has gotten into a state where it's cluster state knows about all three members, but marks one as unhealthy because it's not responding to etcd checks. However, the reason it's not responding is because the gRPC command to join the cluster hasn't been initiated, because it already knows the member exists.

Of note is that this host runs two instances of etcd-manager, one for events and one for main Kubernetes objects. Only one of the instances is "broken".

Log excerpt from etcd-manager leader:

0103 18:44:51.268012   18771 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0103 18:44:51.818575   18771 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0103 18:44:51.879649   18771 hosts.go:84] hosts update: primary=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]], fallbacks=map[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:[172.28.192.102 172.28.192.127 172.28.192.102] etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:[172.28.194.230 172.28.194.230 172.28.194.57] etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:[172.28.196.130 172.28.196.130]], final=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.196.130:[etcd-events-etcd-us-west-2c.internal.redacted.k8s.local etcd-events-etcd-us-west-2c.internal.redacted.k8s.local]]
I0103 18:44:51.879750   18771 hosts.go:181] skipping update of unchanged /etc/hosts
2020-01-03 18:44:55.679341 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-01-03 18:44:55.679371 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
I0103 18:44:58.933053   18771 controller.go:173] starting controller iteration
I0103 18:44:58.933090   18771 controller.go:269] I am leader with token "[REDACTED]"
2020-01-03 18:45:00.679490 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-03 18:45:00.679521 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
W0103 18:45:03.957167   18771 controller.go:703] health-check unable to reach member 2595344402187300919: error building etcd client for https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002: dial tcp 172.28.196.130:4002: connect: connection refused
I0103 18:45:03.957196   18771 controller.go:276] etcd cluster state: etcdClusterState
  members:
    {"name":"etcd-events-etcd-us-west-2c","peerURLs":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002"],"ID":"2595344402187300919"}
      NOT HEALTHY
    {"name":"etcd-events-etcd-us-west-2b","peerURLs":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002"],"ID":"14454711989398209995"}
    {"name":"etcd-events-etcd-us-west-2a","peerURLs":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002"],"ID":"16707933308235350511"}
  peers:
    etcdClusterPeerInfo{peer=peer{id:"etcd-events-etcd-us-west-2a" endpoints:"172.28.192.102:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-etcd-us-west-2a" peer_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:3995" > etcd_state:<cluster:<cluster_token:"Ty6K7M5AzR1HeBeARXgqAA" nodes:<name:"etcd-events-etcd-us-west-2c" peer_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2a" peer_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2b" peer_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:3995" tls_enabled:true > > etcd_version:"3.3.13" > }
    etcdClusterPeerInfo{peer=peer{id:"etcd-events-etcd-us-west-2b" endpoints:"172.28.194.230:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-etcd-us-west-2b" peer_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:3995" > etcd_state:<cluster:<cluster_token:"Ty6K7M5AzR1HeBeARXgqAA" nodes:<name:"etcd-events-etcd-us-west-2c" peer_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2a" peer_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:3995" tls_enabled:true > nodes:<name:"etcd-events-etcd-us-west-2b" peer_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:3995" tls_enabled:true > > etcd_version:"3.3.13" > }
    etcdClusterPeerInfo{peer=peer{id:"etcd-events-etcd-us-west-2c" endpoints:"172.28.196.130:3997" }, info=cluster_name:"etcd-events" node_configuration:<name:"etcd-events-etcd-us-west-2c" peer_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381" client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002" quarantined_client_urls:"https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:3995" > }
I0103 18:45:03.957341   18771 controller.go:277] etcd cluster members: map[14454711989398209995:{"name":"etcd-events-etcd-us-west-2b","peerURLs":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:4002"],"ID":"14454711989398209995"} 16707933308235350511:{"name":"etcd-events-etcd-us-west-2a","peerURLs":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:4002"],"ID":"16707933308235350511"} 2595344402187300919:{"name":"etcd-events-etcd-us-west-2c","peerURLs":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:2381"],"endpoints":["https://etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:4002"],"ID":"2595344402187300919"}]
I0103 18:45:03.957362   18771 controller.go:615] sending member map to all peers: members:<name:"etcd-events-etcd-us-west-2a" dns:"etcd-events-etcd-us-west-2a.internal.redacted.k8s.local" addresses:"172.28.192.102" > members:<name:"etcd-events-etcd-us-west-2b" dns:"etcd-events-etcd-us-west-2b.internal.redacted.k8s.local" addresses:"172.28.194.230" >
I0103 18:45:03.957569   18771 etcdserver.go:226] updating hosts: map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]]
I0103 18:45:03.957808   18771 hosts.go:84] hosts update: primary=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]], fallbacks=map[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:[172.28.192.102 172.28.192.127 172.28.192.102] etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:[172.28.194.230 172.28.194.230 172.28.194.57] etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:[172.28.196.130 172.28.196.130]], final=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.196.130:[etcd-events-etcd-us-west-2c.internal.redacted.k8s.local etcd-events-etcd-us-west-2c.internal.redacted.k8s.local]]
I0103 18:45:04.011465   18771 commands.go:22] not refreshing commands - TTL not hit
I0103 18:45:04.011495   18771 s3fs.go:220] Reading file "s3://zendesk-compute-kops-state-staging/redacted.k8s.local/backups/etcd/events/control/etcd-cluster-created"
I0103 18:45:04.042214   18771 controller.go:369] spec member_count:3 etcd_version:"3.3.13"
I0103 18:45:04.042271   18771 controller.go:494] etcd has unhealthy members, but we already have a slot where we could add another member
I0103 18:45:04.042294   18771 controller.go:531] controller loop complete
2020-01-03 18:45:05.679628 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-03 18:45:05.679658 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")
2020-01-03 18:45:10.679762 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_SNAPSHOT")
2020-01-03 18:45:10.679790 W | rafthttp: health check for peer 240483fbaa2c3037 could not connect: dial tcp 172.28.196.130:2381: connect: connection refused (prober "ROUND_TRIPPER_RAFT_MESSAGE")

adammw commented 4 years ago

Logs from the 2c node not coming up:

I0103 17:48:26.432863   13772 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I0103 17:48:26.433831   13772 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I0103 17:48:26.434384   13772 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I0103 17:48:26.434920   13772 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I0103 17:48:26.435415   13772 main.go:254] Mounting available etcd volumes matching tags [k8s.io/etcd/events k8s.io/role/master=1 kubernetes.io/cluster/redacted.k8s.local=owned]; nameTag=k8s.io/etcd/events
I0103 17:48:26.436472   13772 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0103 17:48:26.553693   13772 mounter.go:64] Master volume "vol-09606ef630d315dcb" is attached at "/dev/xvdv"
I0103 17:48:26.553725   13772 mounter.go:78] Doing safe-format-and-mount of /dev/xvdv to /mnt/master-vol-09606ef630d315dcb
I0103 17:48:26.553737   13772 volumes.go:233] volume vol-09606ef630d315dcb not mounted at /rootfs/dev/xvdv
I0103 17:48:26.553773   13772 volumes.go:247] found nvme volume "nvme-Amazon_Elastic_Block_Store_vol09606ef630d315dcb" at "/dev/nvme2n1"
I0103 17:48:26.553783   13772 mounter.go:116] Found volume "vol-09606ef630d315dcb" mounted at device "/dev/nvme2n1"
I0103 17:48:26.554367   13772 mounter.go:173] Device already mounted on "/mnt/master-vol-09606ef630d315dcb", verifying it is our device
I0103 17:48:26.554380   13772 mounter.go:185] Found existing mount of "/dev/nvme2n1" at "/mnt/master-vol-09606ef630d315dcb"
I0103 17:48:26.554444   13772 mount_linux.go:164] Detected OS without systemd
I0103 17:48:26.555078   13772 mounter.go:226] matched device "/dev/nvme2n1" and "/dev/nvme2n1" via '\x00'
I0103 17:48:26.555090   13772 mounter.go:86] mounted master volume "vol-09606ef630d315dcb" on /mnt/master-vol-09606ef630d315dcb
I0103 17:48:26.555099   13772 main.go:269] discovered IP address: 172.28.196.130
I0103 17:48:26.555104   13772 main.go:274] Setting data dir to /rootfs/mnt/master-vol-09606ef630d315dcb
I0103 17:48:26.556281   13772 server.go:71] starting GRPC server using TLS, ServerName="etcd-manager-server-etcd-events-etcd-us-west-2c"
I0103 17:48:26.715787   13772 server.go:89] GRPC server listening on "172.28.196.130:3997"
I0103 17:48:26.715965   13772 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0103 17:48:26.783807   13772 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0103 17:48:26.897586   13772 peers.go:101] found new candidate peer from discovery: etcd-events-etcd-us-west-2a [{172.28.192.102 0} {172.28.192.127 0} {172.28.192.102 0}]
I0103 17:48:26.897625   13772 peers.go:101] found new candidate peer from discovery: etcd-events-etcd-us-west-2b [{172.28.194.230 0} {172.28.194.230 0} {172.28.194.57 0}]
I0103 17:48:26.897633   13772 peers.go:101] found new candidate peer from discovery: etcd-events-etcd-us-west-2c [{172.28.196.130 0} {172.28.196.130 0}]
I0103 17:48:26.897649   13772 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:[172.28.192.102 172.28.192.127 172.28.192.102] etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:[172.28.194.230 172.28.194.230 172.28.194.57] etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:[172.28.196.130 172.28.196.130]], final=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.192.127:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.194.57:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.196.130:[etcd-events-etcd-us-west-2c.internal.redacted.k8s.local etcd-events-etcd-us-west-2c.internal.redacted.k8s.local]]
I0103 17:48:26.897732   13772 peers.go:281] connecting to peer "etcd-events-etcd-us-west-2a" with TLS policy, servername="etcd-manager-server-etcd-events-etcd-us-west-2a"
I0103 17:48:26.897757   13772 peers.go:281] connecting to peer "etcd-events-etcd-us-west-2c" with TLS policy, servername="etcd-manager-server-etcd-events-etcd-us-west-2c"
I0103 17:48:26.897734   13772 peers.go:281] connecting to peer "etcd-events-etcd-us-west-2b" with TLS policy, servername="etcd-manager-server-etcd-events-etcd-us-west-2b"
I0103 17:48:27.048261   13772 etcdserver.go:226] updating hosts: map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]]
I0103 17:48:27.048295   13772 hosts.go:84] hosts update: primary=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]], fallbacks=map[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:[172.28.192.102 172.28.192.127 172.28.192.102] etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:[172.28.194.230 172.28.194.230 172.28.194.57] etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:[172.28.196.130 172.28.196.130]], final=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.196.130:[etcd-events-etcd-us-west-2c.internal.redacted.k8s.local etcd-events-etcd-us-west-2c.internal.redacted.k8s.local]]
I0103 17:48:28.715924   13772 controller.go:173] starting controller iteration
I0103 17:48:28.715976   13772 controller.go:198] we are not leader

(last two lines continue printing every so many seconds)

adammw commented 4 years ago

cc @justinsb

adammw commented 4 years ago

It seems that the etcd-manager leader is supposed to be telling 2c to start, but it's not since 2c thinks that etcd has already started but is just unhealthy.

nixop commented 4 years ago

I faced the same issue. Is there any workaround?

Dimitar-Boychev commented 4 years ago

I have the same issue with etcd-events.... I have tried to scale the masters down in the affected zone, then got them back to 1 (1 in this zone, 3 in total). This recreated the master with empty volumes for etcd-main and etcd-events. The etcd-main joined without problems, but etcd-events stuck and the volume is empty. My logs look exactly the same 3 endpoints, one not healthy, and the new etcd-events does not want to get the data from the existing cluster.

pkutishch commented 4 years ago

Plus one in here, just hit this... any workaround will be appreciated

Dimitar-Boychev commented 4 years ago

@pkutishch we have found a fix. 1) Scale down the InstanceGroup with the broken etcd. 2) Exec inside a pod with a working ecd and remove the broken member with "etcdctl member remove". 3) Scale the InstanceGroup back to 1 and our node joined the cluster successfully.

I think this is because etcd sees the member as an already existing cluster member and just tries to join it, instead of checking the data and rejoining the node as a new one to resync the data. We have reported this alongside with our way to fix it in https://github.com/kubernetes/kops/issues/9264

pkutishch commented 4 years ago

@Dimitar-Boychev Here is the thing. I got it running in manual mode but symptoms as follows:

etcd-manager running and not producing any errors for the event instance.
termination of the master node didn't help as well (it's a kops cluster)

As solution i started etcd manually in the etcd manager container, honestly i expect etcd process to throw error during startup , but it started fine and broad node to the cluster, but with restarting etcdmanager it didn't work, even with attempts to upgrade version.

Interestingly this happened only on the one node out of three

ex3cv commented 4 years ago

We hit this bug few days ago. Due to the incident we had to delete VM and lost all data stored on attached volumes (for events and main etcd instances). Since then we cannot make one, newly provisioned master node to join the cluster.

Termination of the failed master doesn't work for us. And to start etcd manually you have to have all the puzzles together however in our case we don't have all necessary keys and certs. Does anyone know how we can recreate them?

Dimitar-Boychev commented 4 years ago

@pkutishch I think that by restarting the etcdmanager you put your self in the same situation as before. Here is what I think it happens: 1) You delete the etcd node from the cluster. 2) You run it manually inside the etcd manager container and it starts fine as the new etcd member is not known to the cluster and considered a new one... meaning the cluster provisions the data to the new one and registers it as a memeber. 3) Then you think ok I fixed it and restart it via the etcdmanager... then etcdmanager starts the pods, but the hostname is the same, so when the cluster sees the pods it recognizes them as already part of the cluster and does not sync the data with them.

If I am right and the above steps are what you think you did... It's probably important to really keep the same steps to fix the problem:

Stop the kops InstaceGroup with the failing etcd master.
Delete the node from the etcd cluster, by going to a healthy etcd node and "etcdctl member remove" it.
Start the InstanceGroup, so when etdcmanager starts the pods and joins them to the cluster, in this very moment the cluster does not know about this nodes (by hostname) and provisions you the data.

All of this is just based on our easy way to recreate this problem by just going on 1 etcd node and deleting all the data in the etcd data directory :) After a restart, the cluster sees the joining etcd node, recognizes the hostname as a member and just tries to join it without checking if the joiner has the data or not....

kopeio / etcd-manager

etcd-manager can get into state where it doesn't start etcd on some nodes #288