Unable to create cluster with more than 1 etcd

mludvig commented 6 years ago

Hi I've been trying for a few hours to create a cluster with 3 etcd instances but always got a timeout. It looks like the ASG for Etcd0 is created first and the instance keeps trying to connect to the other two Etcd instances but they do not yet exist and the initialisation times out. If the Etcd1 and Etcd2 ASGs were created in parallel it would probably work as the instances would start up simultaneously and could connect to each other.

I had the same results both with .etcd.memberIdentityProvider == eip and with eni - in both cases etcd0 tried to connect to the other not-yet-existing nodes, either over EIP or over ENI. In either case it timed out.

I'm using pre-existing VPC with existing subnets - 3x Private with NAT and 3x DMZ with public IP enabled by default. I tried to put the etcd nodes both in Private and in DMZ and both failed when requested more than 1 node.

steinfletcher commented 6 years ago

Hi, I am also seeing similar behaviour today using both v0.9.8 and v0.9.9.

I have etcd.count: 3 deploying into an existing private subnet. Getting this from journalctl on the first etcd trying to resolve the other 2 etcds (which never launch).

Mar 29 19:37:20 ip-x.eu-west-1.compute.internal etcd-wrapper[1467]: 2018-03-29 19:37:20.996256 W | rafthttp: health check for peer b48943dd77f32763 could not connect: dial tcp x.x.x.x:2380: i/o timeout
Mar 29 19:37:20 ip-x.eu-west-1.compute.internal etcd-wrapper[1467]: 2018-03-29 19:37:20.996304 W | rafthttp: health check for peer 62fde287b92dfdf could not connect: dial tcp x.x.x.x:2380: i/o timeout

Looks like the cfn signal is never sent from Etcd0 and the control pane nested stack fails. From cfn event log:

Etcd0 | Received 0 SUCCESS signal(s) out of 1. Unable to satisfy 100% MinSuccessfulInstancesPercent requirement

If I set etcd.count: 1 then everything works fine. I am a bit stumped and will continue poking around...

luck02 commented 6 years ago

I'm seeing the same behaviour, we're on v0.9.8...

Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.087296 E | etcdserver: publish error: etcdserver: request timed out
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630920 I | raft: 719e986611adb617 is starting a new election at term 510
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630962 I | raft: 719e986611adb617 became candidate at term 511
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630974 I | raft: 719e986611adb617 received MsgVoteResp from 719e986611adb617 at term 511
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630984 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to a0d815f4b93422a9 at term 511
Mar 29 21:32:32 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:32.630992 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to f8cabdc7bae4698a at term 511
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430903 I | raft: 719e986611adb617 is starting a new election at term 511
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430945 I | raft: 719e986611adb617 became candidate at term 512
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430958 I | raft: 719e986611adb617 received MsgVoteResp from 719e986611adb617 at term 512
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430968 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to a0d815f4b93422a9 at term 512
Mar 29 21:32:34 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:34.430977 I | raft: 719e986611adb617 [logterm: 1, index: 3] sent MsgVote request to f8cabdc7bae4698a at term 512
Mar 29 21:32:35 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:35.090013 W | rafthttp: health check for peer a0d815f4b93422a9 could not connect: dial tcp x.y.z.a:2380: i/o timeout
Mar 29 21:32:35 ip-yyyec2.internal etcd-wrapper[1476]: 2018-03-29 21:32:35.091244 W | rafthttp: health check for peer f8cabdc7bae4698a could not connect: dial tcp a.x.y.z:2380: i/o timeout

luck02 commented 6 years ago

It's related to the wait signal @steinfletcher + @mludvig try adding this to your cluster.yaml:

waitSignal:
  enabled: false
  maxBatchSize: 1

the relevant template is:

"{{$etcdInstance.LogicalName}}": {
      "Type": "AWS::AutoScaling::AutoScalingGroup",
      "Properties": {
        "HealthCheckGracePeriod": 600,
        "HealthCheckType": "EC2",
        "LaunchConfigurationName": {
          "Ref": "{{$etcdInstance.LaunchConfigurationLogicalName}}"
        },
        "MaxSize": "1",
        "MetricsCollection": [
          {
            "Granularity": "1Minute"
          }
        ],
        "MinSize": "1",
        "Tags": [
          {
            "Key": "kubernetes.io/cluster/{{$.ClusterName}}",
            "PropagateAtLaunch": "true",
            "Value": "true"
          },
          {
            "Key": "Name",
            "PropagateAtLaunch": "true",
            "Value": "{{$.ClusterName}}-{{$.StackName}}-kube-aws-etcd-{{$etcdIndex}}"
          },
          {
            "Key": "kube-aws:role",
            "PropagateAtLaunch": "true",
            "Value": "etcd"
          }
        ],
        "VPCZoneIdentifier": [
          {{$etcdInstance.SubnetRef}}
        ]
      },
      {{if $.WaitSignal.Enabled}}
      "CreationPolicy" : {
        "ResourceSignal" : {
          "Count" : "1",
          "Timeout" : "{{$.Controller.CreateTimeout}}"
        }
      },
      {{end}}
      "UpdatePolicy" : {
        "AutoScalingRollingUpdate" : {
          "MinInstancesInService" : "0",
          "MaxBatchSize" : "1",
          {{if $.WaitSignal.Enabled}}
          "WaitOnResourceSignals" : "true",
          "PauseTime": "{{$.Controller.CreateTimeout}}"
          {{else}}
          "PauseTime": "PT2M"
          {{end}}
        }
      },

I was able to get this working by disabling the signal, the next question is, how did this ever work? Something underlying in the cfn engine must have changed WRT to simultaneous execution.

Here's my etcd log after setting wait to false:

-- Logs begin at Thu 2018-03-29 22:20:06 UTC. --
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1]: Started Session 1 of user core.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd-logind[780]: New session 1 of user core.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Paths.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Sockets.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Timers.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Basic System.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Reached target Default.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1530]: Startup finished in 23ms.
Mar 29 22:25:21 ip-d.d.d.d.ec2.internal systemd[1]: Started User Manager for UID 500.
Mar 29 22:25:23 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:23.592989 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:25 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:25.906323 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:25 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:25.906359 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:28 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:28.593180 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:31 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:31.108598 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:31 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:31.108630 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:33 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:33.593363 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:36 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:36.310906 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:36 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:36.310939 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:38 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:38.593555 W | rafthttp: health check for peer 596daac612174e37 could not connect: dial tcp d.d.d.d:2380: i/o timeout
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.513243 W | etcdserver: failed to reach the peerURL(https://d.d.d.d.compute-1.amazonaws.com:2380) of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.513275 W | etcdserver: cannot get the version of member 596daac612174e37 (Get https://d.d.d.d.compute-1.amazonaws.com:2380/version: dial tcp d.d.d.d:2380: i/o timeout)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.901783 I | rafthttp: peer 596daac612174e37 became active
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.901825 I | rafthttp: established a TCP streaming connection with peer 596daac612174e37 (stream MsgApp v2 reader)
Mar 29 22:25:41 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:41.902251 I | rafthttp: established a TCP streaming connection with peer 596daac612174e37 (stream Message reader)
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.526774 I | etcdserver: updating the cluster version from 3.0 to 3.2
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.529138 N | etcdserver/membership: updated the cluster version from 3.0 to 3.2
Mar 29 22:25:45 ip-d.d.d.d.ec2.internal etcd-wrapper[1465]: 2018-03-29 22:25:45.529325 I | etcdserver/api: enabled capabilities for version 3.2

luck02 commented 6 years ago

We've asked our AWS Technical Account Managers to see if the CF team can shed any insight.

The other thing I'm wondering, and haven't had a chance to check yet. Perhaps the etcd version / image isn't locked down and something changed there? I'll look later this eve when I have time.

steinfletcher commented 6 years ago

Thanks @luck02. "Something underlying in the cfn engine must have changed WRT to simultaneous execution." Yeah I am also suspecting this.

mumoshu commented 6 years ago

Each etcd node has a dedicated ASG which depends on the next etcd node for sequential launch and rolling update, so there should be no simultaneous execution(if that's what you meant).

The first etcd node in your cluster should just start without waiting for any other etcd nodes as implemented in etcdadm, so in my understanding, something like reported here shouldn't happen normally.

I had troubleshooted before that certain user-provided EC2 tags on etcd nodes confused etcadm so that it had been unable to calculate a correct number of "running etcd nodes", and therefore it failed to bootstrap any etcd cluster with more than 1 nodes.

Can you confirm that you do have bad stackTags in cluster.yaml, and omitting them resolves the issue? Thx!

mludvig commented 6 years ago

Hi, thanks for the answer. Nope I don't have stackTags set:

# AWS Tags for cloudformation stack resources
#stackTags:
#  Name: "Kubernetes"
#  Environment: "Production"

mumoshu commented 6 years ago

@mludvig Thx! Would you mind sharing me the result of journalctl -u etcdadm-reconfigure.service on your failing etcd node? A github gist would be nice.

mludvig commented 6 years ago

Here:

ip-10-0-10-151 ~ # journalctl -u etcdadm-reconfigure.service
-- Logs begin at Mon 2018-04-02 08:59:20 UTC, end at Mon 2018-04-02 09:04:37 UTC. --
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal systemd[1]: Starting etcdadm reconfigure runner...
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CACERT="/etc/ssl/certs/etcd-trusted-ca.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CA_FILE="/etc/ssl/certs/etcd-trusted-ca.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CERT="/etc/ssl/certs/etcd-client.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_CERT_FILE="/etc/ssl/certs/etcd-client.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_KEY="/etc/ssl/certs/etcd-client-key.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: declare -x ETCDCTL_KEY_FILE="/etc/ssl/certs/etcd-client-key.pem"
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1376]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/[ -w /var/run/coreos/etcdadm ]
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1376]: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr 02 09:00:09 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1376]: pam_unix(sudo:session): session closed for user root
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1391]:     root : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/bin/[ -w /var/run/coreos/etcdadm/snapshots ]
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1391]: pam_unix(sudo:session): session opened for user root by (uid=0)
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal sudo[1391]: pam_unix(sudo:session): session closed for user root
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal etcdadm[1370]: panic! etcd data dir "/var/lib/etcd2" does not exist
Apr 02 09:00:10 ip-10-0-10-151.ap-southeast-2.compute.internal systemd[1]: Started etcdadm reconfigure runner.
ip-10-0-10-151 ~ #

Interestingly /var/lib/etcd2 exists:

ip-10-0-10-151 ~ # find /var/lib/etcd2
/var/lib/etcd2
/var/lib/etcd2/member
/var/lib/etcd2/member/snap
/var/lib/etcd2/member/snap/db
/var/lib/etcd2/member/wal
/var/lib/etcd2/member/wal/0.tmp
/var/lib/etcd2/member/wal/0000000000000000-0000000000000000.wal
/var/lib/etcd2/lost+found

luck02 commented 6 years ago

@mumoshu FWIW, here's our stack tags:

# AWS Tags for cloudformation stack resources
stackTags:
  environment: "{{ stack_env }}"
  project:     "{{ PROJECT_NAME }}"
  owner:       "{{ PROJECT_OWNER }}"

Note that this hasn't changed and we're using a template to populate, on execution it would be more like:

# AWS Tags for cloudformation stack resources
stackTags:
  environment: "test"
  project:     "ub-data-infrastructure/cluster"
  owner:       "dataops"

Again, this hasn't changed, I'd be interested in hearing more along the lines of: "I had troubleshooted before that certain user-provided EC2 tags on etcd nodes confused etcadm so that it had been unable to calculate a correct number of "running etcd nodes", and therefore it failed to bootstrap any etcd cluster with more than 1 nodes."

In the meantime if you'd like to see our etcd logs as requested above I can provide them as well, I just need to undo the waitSignal change:

waitSignal:
  enabled: false
  maxBatchSize: 1

jcrugzz commented 6 years ago

yea this started happening to me last Friday when i tried to create a new cluster with a basically identical config to a cluster i created a few weeks earlier. Something subtle definitely must have changed. Im currently afraid to run kube-aws update on any of my clusters but I need to soon. Can I trust that waitSignal work around for updating a live prod cluster? Or do I need to think about other options.

I have a hard time thinking its a stackTags issue in my case since it was never a problem previously.

How this manifested for me was a "etcdadm-check.service: Failed with result 'exit-code'." happening on the first etcd node that tried to come up, preventing anything else from happening.

luck02 commented 6 years ago

@jcrugzz I'm just working my way through some fixes (waitSignal included). I expect to be deploying to our production this evening / tomorow. I will update with my experiences. I am running into some other issues but they may be unrelated to this issue.

jcrugzz commented 6 years ago

Thanks @luck02 appreciate it!

iherbmatt commented 6 years ago

Hey Guys. I disabled the wait signal, and it generated all the appropriate machines, however the masters are no longer healthy. The cluster.yaml file I'm using is one I've been using since 0.9.9 originally came out. Should it work entirely by uncommenting out the waitSignal and setting it to be disabled?

luck02 commented 6 years ago

@jcrugzz / everyone else.

I've burned quite a bit of time testing this. I don't think disabling waitSignal is going to be viable. Quite a few of my validation steps start failing randomly. Of course YMMV but we want to validate our cluster is healthy at the end and the waitSignal disable makes that challenging.

I did hear back from our AWS technical account managers. They claim 0 changes in the underlying CFN code. They've offered to investigate a failed stack for us which I'll set up tomorrow morning (PST). I didn't see the etcd container having a specific version so my next theory is that if the image isn't locked down it could be such that we're pulling a different container and we could be seeing drift there (IE perhaps their not reporting success / failure in the same way/ differently etc).

I'll continue investigating.

ktateish commented 6 years ago

I'm facing on the same problem too. I noticed some behavior:

When It failed , journalctl -u etcdadm-reconfigure on the etcd0 node showed logs like @mludvig reported.
When systemctl restart etcdadm-reconfigure on the etcd0 node after kube-aws up failed, etcdadm-reconfigure looks like working properly (logs show pulling container images successfully).

I tried kube-aws up several times and it always succeeds so far on my environment after applying the following patch to the userdata:


diff --git a/userdata/cloud-config-etcd b/userdata/cloud-config-etcd
index cf306f6..f613337 100644
--- a/userdata/cloud-config-etcd
+++ b/userdata/cloud-config-etcd
@@ -156,6 +156,7 @@ coreos:
     RestartSec=5
     EnvironmentFile=-/etc/etcd-environment
     EnvironmentFile=-/var/run/coreos/etcdadm-environment

ExecStartPre=/usr/bin/sleep 60 ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots ExecStart=/opt/bin/etcdadm reconfigure

I think that the etcdadm-reconfigure unit seems started too early on etcd nodes' boot.

luck02 commented 6 years ago

I just checked and a new version of etcd was released 6 days ago, so presumably it's related.

I'm just cleaning up a semi related mess and then going to set our etcd version back to what was out a month ago. I'm assuming that's going to solve the issue as well.

I'll report back when I'm done.

etcd version is set here: https://github.com/kubernetes-incubator/kube-aws/blob/master/core/controlplane/config/templates/cluster.yaml#L648

I'd expect to be testing that this evening / tomorrow.

kylegoch commented 6 years ago

Seeing the exact same behavior as well. We were testing a dev build. Had a known working cluster.yaml, went to recreate and got the same errors as above.

We are using etcd version 3.2.10

Edit: Using @ktateish 's patch from above on the userdata made everything work again. Wonder why it broke in the first place.

luck02 commented 6 years ago

@kylegoch so you've pinned your etcd version to v3.2.1 which according to github was built on Jun 23, 2017?

Ok, that's really odd. Something changed... If it wasn't CFN and it wasn't etcd...

I'm going to experiment with pinning my version to something older than last month just to replicate the issue with a pinned version (previously we weren't pinning the version)

kylegoch commented 6 years ago

We are using 3.2.10 from November. Not sure why that version, but thats what we have always used.

And the cluster.yaml im working with right now, worked just fine about 10 days ago.

iherbmatt commented 6 years ago

I've been using etcd 3.2.6 and I still incurred this issue as well.

mludvig commented 6 years ago

I can confirm that @ktateish 's workaround with sleep 60 works for me, I just created a cluster with 3 etcd nodes:

+00:02:57   Controlplane    CREATE_IN_PROGRESS              Etcd0                 
+00:02:57   Controlplane    CREATE_IN_PROGRESS              Etcd0                   "Resource creation Initiated"
+00:06:24   Controlplane    CREATE_IN_PROGRESS              Etcd0                   "Received SUCCESS signal with UniqueId i-0b5da874acdc0e7bb"
+00:06:25   Controlplane    CREATE_COMPLETE                 Etcd0                 
+00:06:30   Controlplane    CREATE_IN_PROGRESS              Etcd1                 
+00:06:31   Controlplane    CREATE_IN_PROGRESS              Etcd1                   "Resource creation Initiated"
+00:09:56   Controlplane    CREATE_IN_PROGRESS              Etcd1                   "Received SUCCESS signal with UniqueId i-0be602b3afcacc247"
+00:09:58   Controlplane    CREATE_COMPLETE                 Etcd1                 
+00:10:02   Controlplane    CREATE_IN_PROGRESS              Etcd2                 
+00:10:03   Controlplane    CREATE_IN_PROGRESS              Etcd2                   "Resource creation Initiated"
+00:12:47   Controlplane    CREATE_IN_PROGRESS              Etcd2                   "Received SUCCESS signal with UniqueId i-097a60f76baa844f7"
+00:12:48   Controlplane    CREATE_COMPLETE                 Etcd2

luck02 commented 6 years ago

Now i'm wondering if the versioning provided in the cluster.yaml is effective. I just added this to my cluster.yaml config:

etcd:
 #etc
  version: 3.3.1

but when I log into the etcd from my failed cluster I get:

core@ip-x-y-z-etc ~ $ etcdctl version
etcdctl version: 3.2.15
API version: 3.2

3.2.15 was build in January, and I see it's a failed cluster so presumably that's the end of the line for this enquiry. I'll do the sleep workaround for now.

iherbmatt commented 6 years ago

Hi Gary,

Are you baking that into a build? Or what would typically be the best way to make this change manually? Should we be doing this manually?

Thank you!

On Wed, Apr 4, 2018 at 3:54 PM, Gary Lucas notifications@github.com wrote:

Now i'm wondering if the versioning provided in the cluster.yaml is effective. I just added this to my cluster.yaml config:

etcd:

etc

version: 3.3.1

but when I log into the etcd from my failed cluster I get:

core@ip-x-y-z-etc ~ $ etcdctl version etcdctl version: 3.2.15 API version: 3.2

3.2.15 was build in January, and I see it's a failed cluster so presumably that's the end of the line for this enquiry. I'll do the sleep workaround for now.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/kubernetes-incubator/kube-aws/issues/1206#issuecomment-378770364, or mute the thread https://github.com/notifications/unsubscribe-auth/AWH4rjqELXg1Nfs7PJpTFfO2FRFVLRIzks5tlU88gaJpZM4S_uCR .

-- *The information contained in this message is the sole and exclusive property of iHerb Inc. and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of iHerb Inc. If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.*

luck02 commented 6 years ago

@iherbmatt depends on your setup, for us it's a bit complicated and the easiest way for me to do this is to apply a hotfix to the kube-aws source code and build myself a new hotfix version. But that's because our deployment pipeline doesn't leave the artifacts for me to locally jury rig. We do have some pipeline stuff I could jury rig up to apply the fix, but it's really ugly (ansible - lineinfile - regex etc).

iherbmatt commented 6 years ago

Your fix for the iops seemed to work well by cherry picking - not sure if that's what you mean by hotfix. I haven't been able to build clusters in over a week, so I'm desperate and really appreciate your looking into this :)

On Wed, Apr 4, 2018 at 4:17 PM, Gary Lucas notifications@github.com wrote:

@iherbmatt https://github.com/iherbmatt depends on your setup, for us it's a bit complicated and the easiest way for me to do this is to apply a hotfix to the kube-aws source code and build myself a new hotfix version. But that's because our deployment pipeline doesn't leave the artifacts for me to locally jury rig. We do have some pipeline stuff I could jury rig up to apply the fix, but it's really ugly (ansible - lineinfile - regex etc).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes-incubator/kube-aws/issues/1206#issuecomment-378774570, or mute the thread https://github.com/notifications/unsubscribe-auth/AWH4rgHyRishM6dfP_lZRBPBKrMy_Wuiks5tlVSPgaJpZM4S_uCR .

-- *The information contained in this message is the sole and exclusive property of iHerb Inc. and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of iHerb Inc. If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.*

luck02 commented 6 years ago

That's correct, in this case there's no commit to cherry pick, but applying a diff etc same thing etc. I'm running off of v0.9.8 so this is the patch I applied:

commit 19ad26bd147ec9882dfb7e67f5aa854a331cf2cd (HEAD -> v0.9.8-hotfix4, tag: v0.9.8-hotfix4)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date:   Wed Apr 4 16:03:06 2018 -0700

    more fixii

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index 2d25d487..c8ae763b 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -166,6 +166,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
+        ExecStartPre=/usr/bin/sleep 60
         ExecStart=/opt/bin/etcdadm member_status_set_started
         {{if .Etcd.Snapshot.IsAutomatedForEtcdVersion .Etcd.Version -}}
         ExecStartPost=/usr/bin/systemctl start etcdadm-save.timer

mind you, my new stack isn't up yet.

luck02 commented 6 years ago

Godamn it, I put the 'fix' in the wrong stanza (update service instead of reconfigure)

I'll try again this eve.

luck02 commented 6 years ago

applied this:

commit 4d6a8b89431828638a5414a5a73b4404c58514e9 (HEAD -> v0.9.8-hotfix5, tag: v0.9.8-hotfix5, v0.9.8-hotfix4)
Author: Gary Lucas <gary.lucas@unbounce.com>
Date:   Wed Apr 4 16:46:36 2018 -0700

    moved the sleep command

diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd
index c8ae763b..e85ca23c 100644
--- a/core/controlplane/config/templates/cloud-config-etcd
+++ b/core/controlplane/config/templates/cloud-config-etcd
@@ -140,6 +140,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
+        ExecStartPre=/usr/bin/sleep 60
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure
@@ -166,7 +167,6 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
-        ExecStartPre=/usr/bin/sleep 60
         ExecStart=/opt/bin/etcdadm member_status_set_started
         {{if .Etcd.Snapshot.IsAutomatedForEtcdVersion .Etcd.Version -}}
         ExecStartPost=/usr/bin/systemctl start etcdadm-save.timer
(END)

iherbmatt commented 6 years ago

I'm currently using 0.9.9. This won't make its way into that version huh?

On Wed, Apr 4, 2018 at 4:51 PM, Gary Lucas notifications@github.com wrote:

applied this:

commit 4d6a8b89431828638a5414a5a73b4404c58514e9 (HEAD -> v0.9.8-hotfix5, tag: v0.9.8-hotfix5, v0.9.8-hotfix4) Author: Gary Lucas gary.lucas@unbounce.com Date: Wed Apr 4 16:46:36 2018 -0700
moved the sleep command
diff --git a/core/controlplane/config/templates/cloud-config-etcd b/core/controlplane/config/templates/cloud-config-etcd index c8ae763b..e85ca23c 100644 --- a/core/controlplane/config/templates/cloud-config-etcd +++ b/core/controlplane/config/templates/cloud-config-etcd @@ -140,6 +140,7 @@ coreos: RestartSec=5 EnvironmentFile=-/etc/etcd-environment EnvironmentFile=-/var/run/coreos/etcdadm-environment

ExecStartPre=/usr/bin/sleep 60 ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots ExecStart=/opt/bin/etcdadm reconfigure @@ -166,7 +167,6 @@ coreos: RestartSec=5 EnvironmentFile=-/etc/etcd-environment EnvironmentFile=-/var/run/coreos/etcdadm-environment

ExecStartPre=/usr/bin/sleep 60 ExecStart=/opt/bin/etcdadm member_status_set_started {{if .Etcd.Snapshot.IsAutomatedForEtcdVersion .Etcd.Version -}} ExecStartPost=/usr/bin/systemctl start etcdadm-save.timer (END)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes-incubator/kube-aws/issues/1206#issuecomment-378780287, or mute the thread https://github.com/notifications/unsubscribe-auth/AWH4rs-LgBKOb3w-lsxSnAxIQrd2pq7Bks5tlVyZgaJpZM4S_uCR .

-- *The information contained in this message is the sole and exclusive property of iHerb Inc. and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of iHerb Inc. If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.*

luck02 commented 6 years ago

It doesn't sound like kube-aws is doing hotfix releases. maybe in the future, but yah. i think for the moment we're either building our own hotfixes or upgrading when the new release comes out.

iherbmatt commented 6 years ago

What would be your best recommendation to implementing this fix for 0.9.9?

On Wed, Apr 4, 2018 at 4:54 PM, Gary Lucas notifications@github.com wrote:

It doesn't sound like kube-aws is doing hotfix releases. maybe in the future, but yah. i think for the moment we're either building our own hotfixes or upgrading when the new release comes out.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kubernetes-incubator/kube-aws/issues/1206#issuecomment-378780748, or mute the thread https://github.com/notifications/unsubscribe-auth/AWH4rp9xGKmA0gH_p_AOLhbPL2h2yXkfks5tlV1GgaJpZM4S_uCR .

-- *The information contained in this message is the sole and exclusive property of iHerb Inc. and may be privileged and confidential. It may not be disseminated or distributed to persons or entities other than the ones intended without the written authority of iHerb Inc. If you have received this e-mail in error or are not the intended recipient, you may not use, copy, disseminate or distribute it. Do not open any attachments. Please delete it immediately from your system and notify the sender promptly by e-mail that you have done so.*

mludvig commented 6 years ago

@iherbmatt you can apply the fix to userdata/cloud-config-etcd after you run kube-aws render stack:

kube98 $ kube-aws render stack
kube98 $ patch -p1 < etcd-fix.diff
kube98 $ kube-aws up

No kube-aws source patching needed. That's what I do :)

etcd-fix.diff.txt

luck02 commented 6 years ago

+1 on the above suggestion. My case is a bit more complicated.

mumoshu commented 6 years ago

@luck02

I'd be interested in hearing more along the lines of: "I had troubleshooted before that certain user-provided EC2 tags on etcd nodes confused etcadm so that it had been unable to calculate a correct number of "running etcd nodes", and therefore it failed to bootstrap any etcd cluster with more than 1 nodes."

What I wrote in this issue would help.

In nutshell, adding EC2 instances a tag named KubernetesCluster via stackTags in cluster.yaml results in kube-aws to misunderstand about how many nodes is required until a quorum is met.

etcdadm decides an etcd systemd service to whether wait until etcd notifies about its readiness or not. The decision is made according to the remaining num of etcd nodes required for quorum. And therefore, the KubernetesCluster tag in stackTags effectively blocked any etcd cluster with more than 2 nodes to never come up successfully.

I initially guessed if it was the cause of issues you're seeing, but turns out not after reading through all your comments.

Anyway, if adding sleep helps stabilize the bootstrapping of etcd clusters, would it also help if you used a larger EC2 instance type for etcd nodes?

In my environment, t2.medium had been a minimum requirement for etcd nodes, but only for dev-purpose.

luck02 commented 6 years ago

Just validated with sleep 60 in etcd config my cluster comes up.

another +1 for that fix, I'll read more on @mumoshu's comment above this eve.

kylegoch commented 6 years ago

@mumoshu we dont have a KubernetesCluster tag, however we were using t2.medium instances. Ill test again without the sleep, but with a bigger node size.

kylegoch commented 6 years ago

Removing the sleep 60 and switching to m5.large instance types did not solve the issue.

mumoshu commented 6 years ago

@luck02 @kylegoch @ktateish Thank you so much for your reports!

Probably then, the root cause here would be we're running etcdadm-reconfigure too early so that it fails in the middle of process. Incomplete etcdadm-reconfigure results in missing appropriate systemd unit type simple or notify which must be set to simple for the first etcd node. Otherwise your first etcd node will never become ready, waiting for nonexistent other etcd nodes forever.

And the reconfigure's failure seems to be occurding due to that etcdctl is unable to run due to unformatted /var/lib/etcd2. I believe some etcdctl operation required access to the etcd datadir.

Not sure why this had been working for months and then turned ill recently though.

Assuming my assumptions are correct, would it also solve your issue if you add ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service instead of sleep 60, in the etcdadm-reconfigure.service unit like the below?

diff --git a/userdata/cloud-config-etcd b/userdata/cloud-config-etcd
index cf306f6..f613337 100644
--- a/userdata/cloud-config-etcd
+++ b/userdata/cloud-config-etcd
@@ -156,6 +156,7 @@ coreos:
         RestartSec=5
         EnvironmentFile=-/etc/etcd-environment
         EnvironmentFile=-/var/run/coreos/etcdadm-environment
+        ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service
         ExecStartPre=/usr/bin/systemctl is-active cfn-etcd-environment.service
         ExecStartPre=/usr/bin/mkdir -p /var/run/coreos/etcdadm/snapshots
         ExecStart=/opt/bin/etcdadm reconfigure

mumoshu commented 6 years ago

@luck02 @kylegoch @ktateish @iherbmatt Just curious but have you been enabling etcd disaster recovery like:

etcd:
  snapshot:
    automated: true
  disasterRecovery:
    automated: true

?

If that's the case, what @davidmccormick have seen in #1219 might be caused by the same issue.

kylegoch commented 6 years ago

I had those options set previously (when it was working) and they are still set now with my working cluster i spun up with sleep 60.

luck02 commented 6 years ago

Our etcd config is really simple:

etcd:
  count: {{ etcd_count }}
  subnets:
    - name: PrivateInstance1
    - name: PrivateInstance2
    - name: PrivateInstance3
  securityGroupIds:
    - "{{ bastion_ssh_access_sg_id }}"
  memberIdentityProvider: eip
  dataVolume:
    encrypted: true

So pretty simple and straightforward (other than the templating, but kube-aws never sees that).

ktateish commented 6 years ago

The ExecStartPre=/usr/bin/systemctl is-active format-etcd2-volume.service patch works fine for me. I tried it twice and both clusters successfully came up. Thanks!

The snapshot / disasterRecovery options are not set (i.e. commented out) in my cluster.yaml

#  snapshot:
#    # Set to true to periodically take an etcd snapshot
#    # Beware that this can be enabled only for etcd 3+
#    # Please carefully test if it works as you've expected when being enabled for your production clusters
#    automated: false
#
#  disasterRecovery:
#    # Set to true to automatically execute a disaster-recovery process whenever etcd node(s) seemed to be broken for a while
#    # Beware that this can be enabled only for etcd 3+
#    # Please carefully test if it works as you've expected when being enabled for your production clusters
#    automated: false

davidmccormick commented 6 years ago

Hi, I make it work by changing the type of the etcd-member service from notify (which is the coreos default) to simple. The reason for the blocking as I can see it is that the cfn-signal has an ExecStartPre check that 'etcd-member' is in an active state: -

[Unit]
Wants=etcd-member.service
After=etcd-member.service

[Service]
Type=simple
Restart=on-failure
RestartSec=10

EnvironmentFile=/var/run/coreos/etcd-node.env
ExecStartPre=/usr/bin/systemctl is-active etcd-member.service
ExecStartPre=/usr/bin/rkt fetch quay.io/coreos/awscli:master
ExecStart=-/opt/bin/cfn-signal

And etcd-member does not send it's notification because the cluster is not healthy (there are 2 nodes missing, not yet started). So it is not in an active state. Changing the type of etcd-member to simple by patching the /etc/systemd/system/etcd-member.service.d/20-aws-cluster.conf file seems to resolve this dependency loop (because systemd considers the service 'active' as soon as the rkt container is starting).

mumoshu commented 6 years ago

@ktateish Glad it worked for you! And Thanks a lot for providing the initial patch. It really helped me understand what is going on, in collaboration with all the reports and the another patch from @luck02 and @kylegoch @iherbmatt @mludvig!

Still waiting for responses from others to confirm if my alternative patch works. It would be an another proof of my assumption.

Also, I'm happy to accept a PR for my suggested patch if anyone is interested! I'm currently away from my laptop…

mumoshu commented 6 years ago

@davidmccormick Thx for sharing your hard won fix! I guess your patch works regardless of you enable disaster recovery with auto-snapshotting or not?

And it would fix the issue by unblocking the first etcd member by setting type to simple by default. However, my original intention of dynamically setting etcd-member unit type to notify for the second and the following nodes was to ensure that the etcd members are able to join the cluster.

Just forcing it to simple may results in a succesful cfn-signal even though the second etcd member was unable to join the cluster, right?

davidmccormick commented 6 years ago

@mumoshu Hi yes this would work regardless of enabling disaster recovery. Yes forcing to simple would allow the cfn-signal to proceed in the even that the second node can't join the cluster.

What might make more sense is to deploy all 3 (n) at once when you perform a fresh cluster install but only roll in one-by-one when upgrading - I'm not all that familiar with cloud-formation but I think I might have seen the controllers behaving in this way? This way quorum can be achieved before the cfn-signal is sent. In a fresh install I would personally also bring up the controllers and nodes without waiting too.

Also, if etcdadm-check is going to do the reconfiguring of the service type then itself should not depend on etcd-member being active (same problem - it can't start until the etcd cluster is up and ready to serve content).

ktateish commented 6 years ago

@mumoshu I'll write a PR.

davidmccormick commented 6 years ago

Isn't having a service reconfigure the type of etcd service a lot of added complexity? Isn't the point of the disasterRecovery option that it can recover nodes that have failed to be a part of the etcd cluster? I would rather that it be left as notify but that all etcd nodes are initially created in parallel. What do you think?

ktateish commented 6 years ago

Oh, I missed something. I thought it should be fixed on starting etcdadm-reconfigure in addition to your patch. But by only your patch, it have the same effect with nicer way. Am I right?

kubernetes-retired / kube-aws

Unable to create cluster with more than 1 etcd #1206

etc