Unable to create cluster with more than 1 etcd

mludvig commented 6 years ago

Hi I've been trying for a few hours to create a cluster with 3 etcd instances but always got a timeout. It looks like the ASG for Etcd0 is created first and the instance keeps trying to connect to the other two Etcd instances but they do not yet exist and the initialisation times out. If the Etcd1 and Etcd2 ASGs were created in parallel it would probably work as the instances would start up simultaneously and could connect to each other.

I had the same results both with .etcd.memberIdentityProvider == eip and with eni - in both cases etcd0 tried to connect to the other not-yet-existing nodes, either over EIP or over ENI. In either case it timed out.

I'm using pre-existing VPC with existing subnets - 3x Private with NAT and 3x DMZ with public IP enabled by default. I tried to put the etcd nodes both in Private and in DMZ and both failed when requested more than 1 node.

iherbmatt commented 6 years ago

Hi Everyone,

I'm seeing this now:

member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379 member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379 member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379 cluster is healthy

It appears etcd is healthy and I'm seeing this in the controller logs as well. I'm having trouble getting the controllers to generate now, however. I'm going to try to build it again and see what happens.

luck02 commented 6 years ago

I applied this:

commit 65722a891eca5e8a5ff9538e2837d7bbeb84390f (HEAD -> unbounce-v0.9.8, tag: v0.9.8-hotfix6, origin/unbounce-v0.9.8, v0.9.8-hotfix6, v0.9.8-hotfix5)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date:   Thu Apr 5 08:40:40 2018 -0700

    trying mumoshis fix

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index e85ca23c..b8a56949 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -140,7 +140,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
-        ExecStartPre=/usr/bin/sleep 60
+        ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure

Cluster came up, I'm happy :D

iherbmatt commented 6 years ago

I wonder if it has something to do with the fact that I'm using 0.9.9 instead of 0.9.8. The etcd cluster comes up fine, but my controllers now don't come online, however they are built.

Here is the output I'm seeing loop in journalctl from the controllers:

output.txt

mumoshu commented 6 years ago

@iherbmatt Hi! Kubelet seems fine to me. Can you share the full output from journalctl, rather than kubelet's log only?

mumoshu commented 6 years ago

@davidmccormick

Isn't the point of the disasterRecovery option that it can recover nodes that have failed to be a part of the etcd cluster?

Partially yes, and partially no? I guess you may be confusing two things. Generally there are two major categories in failure cases, transient and permanent failures of etcd node(s).

A transient failure is that the underlying EC2 instance failed due to an AWS infrastructure issue. In this case, the ASG just recreates the EC2 instance to resolve the issue. Suppose you have a 3 nodes etcd cluster, you may notice that you now have 3 ASGs in total, each matches to one etcd node=member. We also have a pool of EIP+EBS pairs from which each etcd member borrows its identity and datadir.

A permanent failure is that e.g. the EBS volume serving etcd datadir corrupted so that you have to recover the etcd member from an etcd snapshot(not EBS snapshot).

etcd.disasterRecovery.automated and etcd.snapshot.automated is for the latter case. And AFAICS, we have no simpler way to do that. Just marking every etcd-member type to simple results in losing

That being said,

Isn't having a service reconfigure the type of etcd service a lot of added complexity?

Definitely. I'm open to ideas to set type to notify statically while somehow allowing us to cover use-cases of:

Rolling-update of etcd nodes, postponed and rolled-back on the new member failed to join the existing cluster
- kube-aws as of today achieves this by setting a cfn DependsOn from the prev to the next etcd ASG=Node
Initial bootstrap of etcd cluster
- DependsOn requires us to provision etcd ASGs one by one, so that we have set type to simple for first N/2 etcd ASGs.

mumoshu commented 6 years ago

@davidmccormick

What might make more sense is to deploy all 3 (n) at once when you perform a fresh cluster install but only roll in one-by-one when upgrading

Good point! This is what I gave up when I first implemented the H/A etcd about a year ago. It should be the time to consider alternative implementations or possible enhancements.

I'm not all that familiar with cloud-formation but I think I might have seen the controllers behaving in this way?

Did you mean kube-aws controller nodes? Then yes, controller nodes are behaving that way - there's a single multi-AZ ASG managing desired number of controller EC2 instances.

Implementation-wise, we can't do the same for etcd nodes though. We have to give each etcd node a stable network identity plus EBS volume, and an EBS volume is tied to single AZ. What if we had a 3-AZ ASG, 3 EBS volumes each is tied to separate AZ, for 3 etcd nodes, and then one of AZs failed? The ASG would try to launch a replacement etcd node in one of available 2 AZs, in which the EBS volume holding the original etcd data doesn't exist! In that sense, I believe we have to get along with 1-etcd-asg-per-az pattern.

But anyway,

This way quorum can be achieved before the cfn-signal is sent. In a fresh install I would personally also bring up the controllers and nodes without waiting too.

This should be discussed further. How about just omitting DependsOn on etcd ASGs for initial bootstrap via kube-aws up, and then add DependsOns on the subsequent kube-aws update run? Would it actually result in a rolling-update of etcd ASGs by DependsOns just added?

iherbmatt commented 6 years ago

@mumoshu I was really excited to see my etcd nodes build successfully. I even logged in and saw they were all healthy, but then I saw the same CloudFormation timeouts on the controllers. I will redact some identifying data from the journalctl log and attach it. Thank you for the time in advance :)

mumoshu commented 6 years ago

@iherbmatt Thanks!

If I could ask for more, sharing us your cluster.yaml would also help! I know a cluster bootstrapping shouldn't be such exciting and hard thing to do but there are certainly many failure cases, which can be pinpointed just by looking at your cluster.yaml.

iherbmatt commented 6 years ago

@mumoshu Here is the cluster.yaml file. cluster-yaml.txt

iherbmatt commented 6 years ago

@mumoshu Here is the journalctl log from the controllers that would not start up. journalctl-redacted.log

Vince-Cercury commented 6 years ago

For me the issue starts with CoreOS 1688.5.3 released in April. The previous version (1632.3.0, Release Date: February 15, 2018) is not an issue.

With the patch from @mumoshu the etcd get updated fine with CoreOS 1688.5.3 . However Controllers don't and rollback.

iherbmatt commented 6 years ago

@mumoshu Any thoughts?

iherbmatt commented 6 years ago

@VinceMD Are you unable to build clusters as well?

mludvig commented 6 years ago

@iherbmatt I had the same problem while testing the proposed fix because I changed the cluster name in cluster.yml but the certificates were still for the old name. That led exactly to the same issue that you observe - after creating etcd nodes the controllers failed to build. Removing credentials/ and recreating the certs fixed it.

Vince-Cercury commented 6 years ago

@iherbmatt correct with latest version of CoreOs. If I use the Feb release of the AMI, then all fine. A colleague of mine is also facing the same issue.

iherbmatt commented 6 years ago

I wish that change would work for me.

It just sits there and eventually times out when trying to build the controllers. I even used CoreOS-stable-1632.3.0-hvm (ami-862140e9)

It's been almost 2 weeks I've been unable to build clusters :(

mumoshu commented 6 years ago

@iherbmatt Sorry for the trouble! Your etcd seems fine. But from the logs I see Calico installer is complaining.

Perhaps you are hit by the recent regression in master? Would you ming trying with kube-aws v0.9.10-rc.3? If it still doesn't work, trying k8s 1.9.3 which is the defaul in 0.9.10-rc.3 may change somethig.

iherbmatt commented 6 years ago

Hi @mumoshu. I was able to generate a cluster with 0.9.10-rc.3 but it had to be running version 1.9.3 otherwise it wouldn't work. Another issue I have, however, is that I cannot use m5's for the etcd's. Any reason you can think of that might explain why? Thanks!

Confushion commented 6 years ago

Hi @mumoshu

Seems you were right about etcdadm-reconfigure.service wanting a formatted /var/lib/etcd2. However, your fix seemed not to wait for the service to be active, but to fail when it is not active yet... So the timeouts were still happening unfortunately.

Below patch fixes this by actually depending on service var-lib-etcd2.mount (which is the one that it should depend on, and that in turn depends on format-etcd2-volume.service anyway...)

Also the WantedBy line wasn't doing anything useful AFAIK...

Thanks.

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index fc077436..a291fdbf 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -151,6 +151,7 @@ coreos:
         Wants=cfn-etcd-environment.service
         After=cfn-etcd-environment.service
         After=network.target
+        After=var-lib-etcd2.mount

         [Service]
         Type=oneshot
@@ -158,7 +159,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
-        ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service
+        ExecStartPre=/usr/bin/systemctl is-active var-lib-etcd2.mount
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure
@@ -167,9 +168,6 @@ coreos:
         {{end -}}
         TimeoutStartSec=120

-        [Install]
-        WantedBy=cfn-etcd-environment.service
-
     - name: etcdadm-update-status.service
       enable: true
       content: |

mumoshu commented 6 years ago

@iherbmatt Ah, sorry for the late reply! The bad news is that m5 and also c5 aren't supported out-of-box yet as mentioned in #1230.

The good news is that there is a patch composed of scripts and systemd units to adapt the NVMe devices to look like legacy devices so that they can be successfully consumed by kube-aws. The patch can be found in issues linked from #1230.

Please don't hesitate to ask me if you still had trouble on anything.

mumoshu commented 6 years ago

@Confushion Certainly - I realized hat my patch wasn't complete at all after seeing your work! Thank you so much for that.

Everyone, @Confushion has kindly contributed #1270 to make etcd bootstrapping even more reliable. It is already merged and will be available in v0.9.10-rc.6 or v0.9.10

davidmccormick commented 6 years ago

Implementation of my previous suggestion to bring the etcd servers up in parallel on a new cluster build. https://github.com/kubernetes-incubator/kube-aws/pull/1357

kubernetes-retired / kube-aws

Unable to create cluster with more than 1 etcd #1206