Open adammw opened 4 years ago
Logs from the 2c node not coming up:
I0103 17:48:26.432863 13772 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I0103 17:48:26.433831 13772 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I0103 17:48:26.434384 13772 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I0103 17:48:26.434920 13772 volumes.go:85] AWS API Request: ec2metadata/GetMetadata
I0103 17:48:26.435415 13772 main.go:254] Mounting available etcd volumes matching tags [k8s.io/etcd/events k8s.io/role/master=1 kubernetes.io/cluster/redacted.k8s.local=owned]; nameTag=k8s.io/etcd/events
I0103 17:48:26.436472 13772 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0103 17:48:26.553693 13772 mounter.go:64] Master volume "vol-09606ef630d315dcb" is attached at "/dev/xvdv"
I0103 17:48:26.553725 13772 mounter.go:78] Doing safe-format-and-mount of /dev/xvdv to /mnt/master-vol-09606ef630d315dcb
I0103 17:48:26.553737 13772 volumes.go:233] volume vol-09606ef630d315dcb not mounted at /rootfs/dev/xvdv
I0103 17:48:26.553773 13772 volumes.go:247] found nvme volume "nvme-Amazon_Elastic_Block_Store_vol09606ef630d315dcb" at "/dev/nvme2n1"
I0103 17:48:26.553783 13772 mounter.go:116] Found volume "vol-09606ef630d315dcb" mounted at device "/dev/nvme2n1"
I0103 17:48:26.554367 13772 mounter.go:173] Device already mounted on "/mnt/master-vol-09606ef630d315dcb", verifying it is our device
I0103 17:48:26.554380 13772 mounter.go:185] Found existing mount of "/dev/nvme2n1" at "/mnt/master-vol-09606ef630d315dcb"
I0103 17:48:26.554444 13772 mount_linux.go:164] Detected OS without systemd
I0103 17:48:26.555078 13772 mounter.go:226] matched device "/dev/nvme2n1" and "/dev/nvme2n1" via '\x00'
I0103 17:48:26.555090 13772 mounter.go:86] mounted master volume "vol-09606ef630d315dcb" on /mnt/master-vol-09606ef630d315dcb
I0103 17:48:26.555099 13772 main.go:269] discovered IP address: 172.28.196.130
I0103 17:48:26.555104 13772 main.go:274] Setting data dir to /rootfs/mnt/master-vol-09606ef630d315dcb
I0103 17:48:26.556281 13772 server.go:71] starting GRPC server using TLS, ServerName="etcd-manager-server-etcd-events-etcd-us-west-2c"
I0103 17:48:26.715787 13772 server.go:89] GRPC server listening on "172.28.196.130:3997"
I0103 17:48:26.715965 13772 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0103 17:48:26.783807 13772 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0103 17:48:26.897586 13772 peers.go:101] found new candidate peer from discovery: etcd-events-etcd-us-west-2a [{172.28.192.102 0} {172.28.192.127 0} {172.28.192.102 0}]
I0103 17:48:26.897625 13772 peers.go:101] found new candidate peer from discovery: etcd-events-etcd-us-west-2b [{172.28.194.230 0} {172.28.194.230 0} {172.28.194.57 0}]
I0103 17:48:26.897633 13772 peers.go:101] found new candidate peer from discovery: etcd-events-etcd-us-west-2c [{172.28.196.130 0} {172.28.196.130 0}]
I0103 17:48:26.897649 13772 hosts.go:84] hosts update: primary=map[], fallbacks=map[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:[172.28.192.102 172.28.192.127 172.28.192.102] etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:[172.28.194.230 172.28.194.230 172.28.194.57] etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:[172.28.196.130 172.28.196.130]], final=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.192.127:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.194.57:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.196.130:[etcd-events-etcd-us-west-2c.internal.redacted.k8s.local etcd-events-etcd-us-west-2c.internal.redacted.k8s.local]]
I0103 17:48:26.897732 13772 peers.go:281] connecting to peer "etcd-events-etcd-us-west-2a" with TLS policy, servername="etcd-manager-server-etcd-events-etcd-us-west-2a"
I0103 17:48:26.897757 13772 peers.go:281] connecting to peer "etcd-events-etcd-us-west-2c" with TLS policy, servername="etcd-manager-server-etcd-events-etcd-us-west-2c"
I0103 17:48:26.897734 13772 peers.go:281] connecting to peer "etcd-events-etcd-us-west-2b" with TLS policy, servername="etcd-manager-server-etcd-events-etcd-us-west-2b"
I0103 17:48:27.048261 13772 etcdserver.go:226] updating hosts: map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]]
I0103 17:48:27.048295 13772 hosts.go:84] hosts update: primary=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local]], fallbacks=map[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local:[172.28.192.102 172.28.192.127 172.28.192.102] etcd-events-etcd-us-west-2b.internal.redacted.k8s.local:[172.28.194.230 172.28.194.230 172.28.194.57] etcd-events-etcd-us-west-2c.internal.redacted.k8s.local:[172.28.196.130 172.28.196.130]], final=map[172.28.192.102:[etcd-events-etcd-us-west-2a.internal.redacted.k8s.local] 172.28.194.230:[etcd-events-etcd-us-west-2b.internal.redacted.k8s.local] 172.28.196.130:[etcd-events-etcd-us-west-2c.internal.redacted.k8s.local etcd-events-etcd-us-west-2c.internal.redacted.k8s.local]]
I0103 17:48:28.715924 13772 controller.go:173] starting controller iteration
I0103 17:48:28.715976 13772 controller.go:198] we are not leader
(last two lines continue printing every so many seconds)
cc @justinsb
It seems that the etcd-manager leader is supposed to be telling 2c to start, but it's not since 2c thinks that etcd has already started but is just unhealthy.
I faced the same issue. Is there any workaround?
I have the same issue with etcd-events.... I have tried to scale the masters down in the affected zone, then got them back to 1 (1 in this zone, 3 in total). This recreated the master with empty volumes for etcd-main and etcd-events. The etcd-main joined without problems, but etcd-events stuck and the volume is empty. My logs look exactly the same 3 endpoints, one not healthy, and the new etcd-events does not want to get the data from the existing cluster.
Plus one in here, just hit this... any workaround will be appreciated
@pkutishch we have found a fix. 1) Scale down the InstanceGroup with the broken etcd. 2) Exec inside a pod with a working ecd and remove the broken member with "etcdctl member remove". 3) Scale the InstanceGroup back to 1 and our node joined the cluster successfully.
I think this is because etcd sees the member as an already existing cluster member and just tries to join it, instead of checking the data and rejoining the node as a new one to resync the data. We have reported this alongside with our way to fix it in https://github.com/kubernetes/kops/issues/9264
@Dimitar-Boychev Here is the thing. I got it running in manual mode but symptoms as follows:
As solution i started etcd manually in the etcd manager container, honestly i expect etcd process to throw error during startup , but it started fine and broad node to the cluster, but with restarting etcdmanager it didn't work, even with attempts to upgrade version.
Interestingly this happened only on the one node out of three
We hit this bug few days ago. Due to the incident we had to delete VM and lost all data stored on attached volumes (for events and main etcd instances). Since then we cannot make one, newly provisioned master node to join the cluster.
Termination of the failed master doesn't work for us. And to start etcd manually you have to have all the puzzles together however in our case we don't have all necessary keys and certs. Does anyone know how we can recreate them?
@pkutishch I think that by restarting the etcdmanager you put your self in the same situation as before. Here is what I think it happens: 1) You delete the etcd node from the cluster. 2) You run it manually inside the etcd manager container and it starts fine as the new etcd member is not known to the cluster and considered a new one... meaning the cluster provisions the data to the new one and registers it as a memeber. 3) Then you think ok I fixed it and restart it via the etcdmanager... then etcdmanager starts the pods, but the hostname is the same, so when the cluster sees the pods it recognizes them as already part of the cluster and does not sync the data with them.
If I am right and the above steps are what you think you did... It's probably important to really keep the same steps to fix the problem:
All of this is just based on our easy way to recreate this problem by just going on 1 etcd node and deleting all the data in the etcd data directory :) After a restart, the cluster sees the joining etcd node, recognizes the hostname as a member and just tries to join it without checking if the joiner has the data or not....
Currently seeing our cluster has gotten into a state where it's cluster state knows about all three members, but marks one as unhealthy because it's not responding to etcd checks. However, the reason it's not responding is because the gRPC command to join the cluster hasn't been initiated, because it already knows the member exists.
Of note is that this host runs two instances of etcd-manager, one for events and one for main Kubernetes objects. Only one of the instances is "broken".
Log excerpt from etcd-manager leader: