Closed mludvig closed 6 years ago
Hi Everyone,
I'm seeing this now:
member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379 member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379 member [omitted] is healthy: got healthy result from https://[omitted].compute.amazonaws.com:2379 cluster is healthy
It appears etcd is healthy and I'm seeing this in the controller logs as well. I'm having trouble getting the controllers to generate now, however. I'm going to try to build it again and see what happens.
I applied this:
commit 65722a891eca5e8a5ff9538e2837d7bbeb84390f (HEAD -> unbounce-v0.9.8, tag: v0.9.8-hotfix6, origin/unbounce-v0.9.8, v0.9.8-hotfix6, v0.9.8-hotfix5)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date: Thu Apr 5 08:40:40 2018 -0700
trying mumoshis fix
diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index e85ca23c..b8a56949 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -140,7 +140,7 @@ coreos:
RestartSec=5
EnvironmentFile=-/etc/etcd-environment
EnvironmentFile=-/var/run/coreos/etcdadm-environment
- ExecStartPre=/usr/bin/sleep 60
+ ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service
ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
ExecStart=/opt/bin/etcdadm reconfigure
Cluster came up, I'm happy :D
I wonder if it has something to do with the fact that I'm using 0.9.9 instead of 0.9.8. The etcd cluster comes up fine, but my controllers now don't come online, however they are built.
Here is the output I'm seeing loop in journalctl from the controllers:
@iherbmatt Hi! Kubelet seems fine to me. Can you share the full output from journalctl
, rather than kubelet's log only?
@davidmccormick
Isn't the point of the disasterRecovery option that it can recover nodes that have failed to be a part of the etcd cluster?
Partially yes, and partially no? I guess you may be confusing two things. Generally there are two major categories in failure cases, transient and permanent failures of etcd node(s).
A transient failure is that the underlying EC2 instance failed due to an AWS infrastructure issue. In this case, the ASG just recreates the EC2 instance to resolve the issue. Suppose you have a 3 nodes etcd cluster, you may notice that you now have 3 ASGs in total, each matches to one etcd node=member. We also have a pool of EIP+EBS pairs from which each etcd member borrows its identity and datadir.
A permanent failure is that e.g. the EBS volume serving etcd datadir corrupted so that you have to recover the etcd member from an etcd snapshot(not EBS snapshot).
etcd.disasterRecovery.automated
and etcd.snapshot.automated
is for the latter case. And AFAICS, we have no simpler way to do that. Just marking every etcd-member type to simple
results in losing
That being said,
Isn't having a service reconfigure the type of etcd service a lot of added complexity?
Definitely. I'm open to ideas to set type
to notify
statically while somehow allowing us to cover use-cases of:
DependsOn
from the prev to the next etcd ASG=NodeDependsOn
requires us to provision etcd ASGs one by one, so that we have set type
to simple
for first N/2
etcd ASGs.@davidmccormick
What might make more sense is to deploy all 3 (n) at once when you perform a fresh cluster install but only roll in one-by-one when upgrading
Good point! This is what I gave up when I first implemented the H/A etcd about a year ago. It should be the time to consider alternative implementations or possible enhancements.
- I'm not all that familiar with cloud-formation but I think I might have seen the controllers behaving in this way?
Did you mean kube-aws controller nodes? Then yes, controller nodes are behaving that way - there's a single multi-AZ ASG managing desired number of controller EC2 instances.
Implementation-wise, we can't do the same for etcd nodes though. We have to give each etcd node a stable network identity plus EBS volume, and an EBS volume is tied to single AZ. What if we had a 3-AZ ASG, 3 EBS volumes each is tied to separate AZ, for 3 etcd nodes, and then one of AZs failed? The ASG would try to launch a replacement etcd node in one of available 2 AZs, in which the EBS volume holding the original etcd data doesn't exist! In that sense, I believe we have to get along with 1-etcd-asg-per-az pattern.
But anyway,
This way quorum can be achieved before the cfn-signal is sent. In a fresh install I would personally also bring up the controllers and nodes without waiting too.
This should be discussed further. How about just omitting DependsOn
on etcd ASGs for initial bootstrap via kube-aws up
, and then add DependsOn
s on the subsequent kube-aws update
run? Would it actually result in a rolling-update of etcd ASGs by DependsOn
s just added?
@mumoshu I was really excited to see my etcd nodes build successfully. I even logged in and saw they were all healthy, but then I saw the same CloudFormation timeouts on the controllers. I will redact some identifying data from the journalctl log and attach it. Thank you for the time in advance :)
@iherbmatt Thanks!
If I could ask for more, sharing us your cluster.yaml would also help! I know a cluster bootstrapping shouldn't be such exciting and hard thing to do but there are certainly many failure cases, which can be pinpointed just by looking at your cluster.yaml.
@mumoshu Here is the cluster.yaml file. cluster-yaml.txt
@mumoshu Here is the journalctl log from the controllers that would not start up. journalctl-redacted.log
For me the issue starts with CoreOS 1688.5.3 released in April. The previous version (1632.3.0, Release Date: February 15, 2018) is not an issue.
With the patch from @mumoshu the etcd get updated fine with CoreOS 1688.5.3 . However Controllers don't and rollback.
@mumoshu Any thoughts?
@VinceMD Are you unable to build clusters as well?
@iherbmatt I had the same problem while testing the proposed fix because I changed the cluster name in cluster.yml
but the certificates were still for the old name. That led exactly to the same issue that you observe - after creating etcd nodes the controllers failed to build. Removing credentials/
and recreating the certs fixed it.
@iherbmatt correct with latest version of CoreOs. If I use the Feb release of the AMI, then all fine. A colleague of mine is also facing the same issue.
I wish that change would work for me.
It just sits there and eventually times out when trying to build the controllers. I even used CoreOS-stable-1632.3.0-hvm (ami-862140e9)
It's been almost 2 weeks I've been unable to build clusters :(
@iherbmatt Sorry for the trouble! Your etcd seems fine. But from the logs I see Calico installer is complaining.
Perhaps you are hit by the recent regression in master? Would you ming trying with kube-aws v0.9.10-rc.3? If it still doesn't work, trying k8s 1.9.3 which is the defaul in 0.9.10-rc.3 may change somethig.
Hi @mumoshu. I was able to generate a cluster with 0.9.10-rc.3 but it had to be running version 1.9.3 otherwise it wouldn't work. Another issue I have, however, is that I cannot use m5's for the etcd's. Any reason you can think of that might explain why? Thanks!
Hi @mumoshu
Seems you were right about etcdadm-reconfigure.service
wanting a formatted /var/lib/etcd2
.
However, your fix seemed not to wait for the service to be active, but to fail when it is not active yet...
So the timeouts were still happening unfortunately.
Below patch fixes this by actually depending on service var-lib-etcd2.mount
(which is the one that it should depend on, and that in turn depends on format-etcd2-volume.service
anyway...)
Also the WantedBy
line wasn't doing anything useful AFAIK...
Thanks.
diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index fc077436..a291fdbf 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -151,6 +151,7 @@ coreos:
Wants=cfn-etcd-environment.service
After=cfn-etcd-environment.service
After=network.target
+ After=var-lib-etcd2.mount
[Service]
Type=oneshot
@@ -158,7 +159,7 @@ coreos:
RestartSec=5
EnvironmentFile=-/etc/etcd-environment
EnvironmentFile=-/var/run/coreos/etcdadm-environment
- ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service
+ ExecStartPre=/usr/bin/systemctl is-active var-lib-etcd2.mount
ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
ExecStart=/opt/bin/etcdadm reconfigure
@@ -167,9 +168,6 @@ coreos:
{{end -}}
TimeoutStartSec=120
- [Install]
- WantedBy=cfn-etcd-environment.service
-
- name: etcdadm-update-status.service
enable: true
content: |
@iherbmatt Ah, sorry for the late reply! The bad news is that m5 and also c5 aren't supported out-of-box yet as mentioned in #1230.
The good news is that there is a patch composed of scripts and systemd units to adapt the NVMe devices to look like legacy devices so that they can be successfully consumed by kube-aws. The patch can be found in issues linked from #1230.
Please don't hesitate to ask me if you still had trouble on anything.
@Confushion Certainly - I realized hat my patch wasn't complete at all after seeing your work! Thank you so much for that.
Everyone, @Confushion has kindly contributed #1270 to make etcd bootstrapping even more reliable. It is already merged and will be available in v0.9.10-rc.6 or v0.9.10
Implementation of my previous suggestion to bring the etcd servers up in parallel on a new cluster build. https://github.com/kubernetes-incubator/kube-aws/pull/1357
Hi I've been trying for a few hours to create a cluster with 3 etcd instances but always got a timeout. It looks like the ASG for Etcd0 is created first and the instance keeps trying to connect to the other two Etcd instances but they do not yet exist and the initialisation times out. If the Etcd1 and Etcd2 ASGs were created in parallel it would probably work as the instances would start up simultaneously and could connect to each other.
I had the same results both with .etcd.memberIdentityProvider == eip and with eni - in both cases etcd0 tried to connect to the other not-yet-existing nodes, either over EIP or over ENI. In either case it timed out.
I'm using pre-existing VPC with existing subnets - 3x Private with NAT and 3x DMZ with public IP enabled by default. I tried to put the etcd nodes both in Private and in DMZ and both failed when requested more than 1 node.