etcd-manager seems to pick up incorrect etcdv3 version

olemarkus commented 5 years ago

1. What kops version are you running? The command kops version, will display this information.

kops 1.12

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag.

1.12.7

3. What cloud provider are you using?

AWS

4. What commands did you run? What is the simplest way to reproduce this issue?

First created a cluster with etcd 3.1.11. Then per upgrade docs, migrated to 3.2.18. Confirmed the etcd pods were indeed running 3.2.18 Rolled masters fast following the guide.

5. What happened after the commands executed?

The masters did not come online. In /var/log/etcd.log I see entries like this:


I0515 10:04:21.595477    3021 etcdserver.go:553] overriding clientURLs with [http://etcd-a.internal.foo.bar:4001] (state had [http://0.0.0.0:4001])
I0515 10:04:21.595491    3021 etcdserver.go:557] overriding quarantinedClientURLs with [http://etcd-a.internal.foo.bar:3994] (state had [http://0.0.0.0:3994])
W0515 10:04:21.595499    3021 pki.go:46] not generating peer keypair as peers-ca not set
W0515 10:04:21.595504    3021 pki.go:84] not generating client keypair as clients-ca not set
W0515 10:04:21.595527    3021 etcdserver.go:92] error running etcd: unknown etcd version v3.1.11: not found in [/opt/etcd-v3.1.11-linux-amd64]```

Again, I confirm that the previous version running was 3.2.18. this is also mentioned in older etcd.log files.

The masters come back online when changing from etcd-manager to legacy.

**6. What did you expect to happen?**

Whatever is happening to understand that 3.2.18 is in use and migrate to manager.

justinsb commented 5 years ago

Hmmm ... sorry about the issue and thanks for reporting. So the etcd pods were running 3.2.18, but somehow we found etcd 3.1.11? Which instructions did you follow for this?

Were you running etcd-backup with 3.1.11? I'm wondering if we picked up an old backup (we shouldn't have)?

I'm going to look at whether we can use 3.1.12 to restore the backup (which is in the image), or maybe even just including 3.1.11 (because it looks like it was recommended for 1.9). But I'm worried that we're trying to restore an old backup, which is a bigger problem...

Another possibility - did you try using etcd-manager in the past, and then switch it off (with this cluster)? Maybe we're finding that...

olemarkus commented 5 years ago

We have not tried to use etcd-manager on this exact cluster before but we did try etcd-manager on a cluster with the exact name before. This was some time ago and the cluster was since reprovisioned. So it could be that the kops deletion did not delete the backup?

That being said, we have a lot of clusters running 3.1.11 because of the old recommendation so supporting a direct upgrade path from 3.1.11 would be much appreciated.

justinsb commented 5 years ago

I can't rule it out that we are looking at the backup, but we shouldn't be and I'd be really surprised given that it requires a "restore backup command" to be in place (but maybe if that command was in place from before...). If you're able to get me some more of the logs I might be able to tell.

https://github.com/kopeio/etcd-manager/blob/master/pkg/legacy/adopt.go#L51

That is the code path we're supposed to go down when first introducing etcd-manager - we look at the manifest on-disk (from the EBS volume) and extract the etcd-version from that. If you could SSH to one of the masters and get that manifest (at least the image bit!) from the EBS volume mount, that would be super helpful also.

I am looking at supporting using etcd 3.1.12 to read backups created with 3.1.11 ( https://github.com/kopeio/etcd-manager/pull/209 ). My real concern there is that this 3.1.11 backup is not what you want to restore.

olemarkus commented 5 years ago

I have some snippets around. Starting etcd-manager on a fresh node, I see a log line like this:

I0515 15:07:37.456694    3819 etcdserver.go:530] starting etcd with state cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"http://etcd-a.internal.foo.bar:2380" client_urls:"http://0.0.0.0:4001" quarantined_client_urls:"http://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"http://etcd-b.internal.foo.bar2380" client_urls:"http://0.0.0.0:4001" quarantined_client_urls:"http://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"http://etcd-c.internal.foo.bar:2380" client_urls:"http://0.0.0.0:4001" quarantined_client_urls:"http://0.0.0.0:3994" > > etcd_version:"3.1.11"

So I got curious where it got this data from. And this comes from the state file in etcd volume. I moved this file somewhere else on all masters, and rolled all the masters. The masters then came up with etcd 3.2.18 running.

The data in the cluster is definitely fresh from just before we tried the upgrade. Before moving that state file I did have a look at the s3 bucket and there were no backups there. Only the control folder. I don't think has been trying to restore anything from backup.

wavemoran commented 5 years ago

I'm seeing a similar error doing a 1.11 > 1.12.1 upgrade.

I0520 19:23:15.459382    5708 etcdserver.go:530] starting etcd with state cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name
:"etcd-a" peer_urls:"https://etcd-a.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994"
 > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https:/
/0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client
_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0"
I0520 19:23:15.459470    5708 etcdserver.go:539] starting etcd with datadir /rootfs/mnt/master-vol-0bb0a3db96a56648c/data/etcd-cluster-token-etcd
I0520 19:23:15.459516    5708 etcdserver.go:553] overriding clientURLs with [https://etcd-c.foo.bar:4001] (state had [https://0.0.0
.0:4001])
I0520 19:23:15.459530    5708 etcdserver.go:557] overriding quarantinedClientURLs with [https://etcd-c.foo.bar:3994] (state had [ht
tps://0.0.0.0:3994])
I0520 19:23:15.461461    5708 pki.go:39] generating peer keypair for etcd: {CommonName:etcd-c Organization:[] AltNames:{DNSNames:[etcd-c.internal.dev-kops01.
mysmithrx.com] IPs:[127.0.0.1]} Usages:[2 1]}
I0520 19:23:15.464632    5708 pki.go:77] building client-serving certificate: {CommonName:etcd-c Organization:[] AltNames:{DNSNames:[etcd-c.internal.foo.bar etcd-c.foo.bar] IPs:[127.0.0.1 127.0.0.1]} Usages:[1]}
W0520 19:23:15.784131    5708 etcdserver.go:92] error running etcd: unknown etcd version v3.2.18-0: not found in [/opt/etcd-v3.2.18-0-linux-amd64]

My difference is I have always been on etcd 3.2.18-0 for this cluster. It has always been this version. I tried moving my state file as @olemarkus suggested, but that resulted in a different error:

I0521 00:45:16.751546   22616 adopt.go:97] parsing legacy manifest /rootfs/mnt/master-vol-080c99d57c410f64c/k8s.io/manifests/etcd-events.manifest
I0521 00:45:16.751958   22616 adopt.go:209] copying etcd data from /rootfs/mnt/master-vol-080c99d57c410f64c/var/etcd/data-events -> /rootfs/mnt/master-vol-080c99d57c410f64c/data/etcd-cluster-token-etcd-events
error initializing etcd server: error reading source directory /rootfs/mnt/master-vol-080c99d57c410f64c/var/etcd/data-events: open /rootfs/mnt/master-vol-080c99d57c410f64c/var/etcd/data-events: no such file or directory

olemarkus commented 5 years ago

Same issue running with kops 1.12.1.

Cluster was initially started with 3.1.11. We first upgraded to 3.1.12. Then upgraded the cluster rolling all masters at the same time.

Still errors with

W0521 11:23:30.588313    3255 etcdserver.go:92] error running etcd: unknown etcd version v3.1.11: not found in [/opt/etcd-v3.1.11-linux-amd64]

olemarkus commented 5 years ago

If the cluster runs with 3.1.11, I get a similar message as above, but also

I0521 11:32:07.141850    3617 controller.go:369] spec member_count:3 etcd_version:"3.1.11" 
W0521 11:32:07.141921    3617 controller.go:1008] we don't support etcd version requested, won't assume forward compatability: member_count:3 etcd_version:"3.1.11" 
W0521 11:32:07.141932    3617 controller.go:149] unexpected error running etcd cluster reconciliation loop: can't act as leader for unknown etcd version "3.1.11"

wavemoran commented 5 years ago

Some additional info for you @justinsb;

I rolled back my cluster to 1.11.9 and etcd version 3.2.18-0 - Everything was back up and working as expected.
Upgraded my etcd version to 3.2.24 and rolled my masters. - Cluster comes up fine
Set my etcd clusters to use provider: Manager per the gradual updates doc

I get the same error about etcd-manager trying to grab the older 3.2.18-0 version.

Here's a fuller set of logs:

I0523 22:57:32.231490    5442 controller.go:173] starting controller iteration
I0523 22:57:32.231522    5442 controller.go:269] I am leader with token "CwS52JU9zpqMgqs4qO0IqQ"
W0523 22:57:37.234421    5442 controller.go:675] unable to reach member etcdClusterPeerInfo{peer=peer{id:"etcd-b" endpoints:"10.12.2.248:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://etcd-b.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-b.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }: error building etcd client for https://etcd-b.internal.foo.bar:4001: dial tcp 10.12.2.248:4001: connect: connection refused
I0523 22:57:39.518739    5442 etcdserver.go:530] starting etcd with state cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0"
I0523 22:57:39.518811    5442 etcdserver.go:539] starting etcd with datadir /rootfs/mnt/master-vol-0b639eae2b4e5bad2/data/etcd-cluster-token-etcd
I0523 22:57:39.518978    5442 etcdserver.go:553] overriding clientURLs with [https://etcd-a.internal.foo.bar:4001] (state had [https://0.0.0.0:4001])
I0523 22:57:39.518994    5442 etcdserver.go:557] overriding quarantinedClientURLs with [https://etcd-a.internal.foo.bar:3994] (state had [https://0.0.0.0:3994])
I0523 22:57:39.519158    5442 pki.go:39] generating peer keypair for etcd: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.internal.foo.bar] IPs:[127.0.0.1]} Usages:[2 1]}
I0523 22:57:39.519486    5442 pki.go:77] building client-serving certificate: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.internal.foo.bar etcd-a.internal.foo.bar] IPs:[127.0.0.1 127.0.0.1]} Usages:[1]}
W0523 22:57:39.628034    5442 etcdserver.go:92] error running etcd: unknown etcd version v3.2.18-0: not found in [/opt/etcd-v3.2.18-0-linux-amd64]
W0523 22:57:42.234716    5442 controller.go:675] unable to reach member etcdClusterPeerInfo{peer=peer{id:"etcd-c" endpoints:"10.12.3.223:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://etcd-c.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-c.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }: error building etcd client for https://etcd-c.internal.foo.bar:4001: dial tcp 10.12.3.223:4001: connect: connection refused
W0523 22:57:47.234992    5442 controller.go:675] unable to reach member etcdClusterPeerInfo{peer=peer{id:"etcd-a" endpoints:"10.12.1.27:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://etcd-a.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-a.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }: error building etcd client for https://etcd-a.internal.foo.bar:4001: dial tcp 10.12.1.27:4001: connect: connection refused
I0523 22:57:47.235084    5442 controller.go:276] etcd cluster state: etcdClusterState
  members:
  peers:
    etcdClusterPeerInfo{peer=peer{id:"etcd-b" endpoints:"10.12.2.248:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://etcd-b.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-b.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }
    etcdClusterPeerInfo{peer=peer{id:"etcd-c" endpoints:"10.12.3.223:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://etcd-c.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-c.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }
    etcdClusterPeerInfo{peer=peer{id:"etcd-a" endpoints:"10.12.1.27:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://etcd-a.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-a.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }
I0523 22:57:47.235331    5442 controller.go:277] etcd cluster members: map[]
I0523 22:57:47.235352    5442 controller.go:615] sending member map to all peers: members:<name:"etcd-b" dns:"etcd-b.internal.foo.bar" addresses:"10.12.2.248" > members:<name:"etcd-c" dns:"etcd-c.internal.foo.bar" addresses:"10.12.3.223" > members:<name:"etcd-a" dns:"etcd-a.internal.foo.bar" addresses:"10.12.1.27" >
I0523 22:57:47.235703    5442 etcdserver.go:222] updating hosts: map[10.12.1.27:[etcd-a.internal.foo.bar] 10.12.2.248:[etcd-b.internal.foo.bar] 10.12.3.223:[etcd-c.internal.foo.bar]]
I0523 22:57:47.235726    5442 hosts.go:84] hosts update: primary=map[10.12.1.27:[etcd-a.internal.foo.bar] 10.12.2.248:[etcd-b.internal.foo.bar] 10.12.3.223:[etcd-c.internal.foo.bar]], fallbacks=map[etcd-a.internal.foo.bar:[10.12.1.27 10.12.1.27] etcd-b.internal.foo.bar:[10.12.2.248 10.12.2.248] etcd-c.internal.foo.bar:[10.12.3.223 10.12.3.223]], final=map[10.12.1.27:[etcd-a.internal.foo.bar] 10.12.2.248:[etcd-b.internal.foo.bar] 10.12.3.223:[etcd-c.internal.foo.bar]]
I0523 22:57:47.276273    5442 commands.go:22] not refreshing commands - TTL not hit
I0523 22:57:47.276312    5442 s3fs.go:220] Reading file "s3://foo-kops/foo.bar/backups/etcd/main/control/etcd-cluster-created"
I0523 22:57:47.599148    5442 controller.go:369] spec member_count:3 etcd_version:"3.2.24"
I0523 22:57:47.599201    5442 controller.go:417] mismatched version for peer peer{id:"etcd-a" endpoints:"10.12.1.27:3996" }: want "3.2.24", have "3.2.18-0"
I0523 22:57:47.599295    5442 controller.go:417] mismatched version for peer peer{id:"etcd-b" endpoints:"10.12.2.248:3996" }: want "3.2.24", have "3.2.18-0"
I0523 22:57:47.599340    5442 controller.go:417] mismatched version for peer peer{id:"etcd-c" endpoints:"10.12.3.223:3996" }: want "3.2.24", have "3.2.18-0"
I0523 22:57:47.599391    5442 commands.go:25] refreshing commands
I0523 22:57:47.759340    5442 vfs.go:104] listed commands in s3://foo-kops/foo.bar/backups/etcd/main/control: 0 commands
I0523 22:57:47.759362    5442 s3fs.go:220] Reading file "s3://foo-kops/foo.bar/backups/etcd/main/control/etcd-cluster-spec"
W0523 22:57:47.840027    5442 controller.go:149] unexpected error running etcd cluster reconciliation loop: etcd has 0 members registered; must issue restore-backup command to proceed
I0523 22:57:49.628145    5442 etcdserver.go:530] starting etcd with state cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0"
I0523 22:57:49.628204    5442 etcdserver.go:539] starting etcd with datadir /rootfs/mnt/master-vol-0b639eae2b4e5bad2/data/etcd-cluster-token-etcd
I0523 22:57:49.628241    5442 etcdserver.go:553] overriding clientURLs with [https://etcd-a.internal.foo.bar:4001] (state had [https://0.0.0.0:4001])
I0523 22:57:49.628250    5442 etcdserver.go:557] overriding quarantinedClientURLs with [https://etcd-a.internal.foo.bar:3994] (state had [https://0.0.0.0:3994])
I0523 22:57:49.628424    5442 pki.go:39] generating peer keypair for etcd: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.internal.foo.bar] IPs:[127.0.0.1]} Usages:[2 1]}
I0523 22:57:49.628795    5442 pki.go:77] building client-serving certificate: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.internal.foo.bar etcd-a.internal.foo.bar] IPs:[127.0.0.1 127.0.0.1]} Usages:[1]}
W0523 22:57:49.684957    5442 etcdserver.go:92] error running etcd: unknown etcd version v3.2.18-0: not found in [/opt/etcd-v3.2.18-0-linux-amd64]
I0523 22:57:55.892405    5442 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0523 22:57:55.966884    5442 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0523 22:57:56.019508    5442 hosts.go:84] hosts update: primary=map[10.12.1.27:[etcd-a.internal.foo.bar] 10.12.2.248:[etcd-b.internal.foo.bar] 10.12.3.223:[etcd-c.internal.foo.bar]], fallbacks=map[etcd-a.internal.foo.bar:[10.12.1.27 10.12.1.27] etcd-b.internal.foo.bar:[10.12.2.248 10.12.2.248] etcd-c.internal.foo.bar:[10.12.3.223 10.12.3.223]], final=map[10.12.1.27:[etcd-a.internal.foo.bar] 10.12.2.248:[etcd-b.internal.foo.bar] 10.12.3.223:[etcd-c.internal.foo.bar]]
I0523 22:57:56.019595    5442 hosts.go:181] skipping update of unchanged /etc/hosts

elisiano commented 5 years ago

I got hit by this as well, I was using 3.1.11, upgraded etcd to 3.1.12 before attempting a cluster upgrade but etcd-manager found only 3.1.11.

what is the right way to upgrade/downgrade etcd inside the same cluster? (I could not find specific docs).

olemarkus commented 5 years ago

With kops 1.12.1. we set provider: Legacy on both etcd clusters and version: 3.1.12, then roll masters gracefully. Once the masters have been rolled, we remove these entries and roll all masters simultaneously.

elisiano commented 5 years ago

@olemarkus thanks for the pointer.

I tried several scenarios but all end up in the same state when I enable etcd-manager.

Initial versions:

etcd: 3.1.11
kubernetes: 1.11.8

From here I tried:

updating etcd to 3.1.12 in place then upgrade to kubernetes 1.12.9 (with the Legacy provider). The cluster upgrades successfully but as soon as I remove the provider entry (ie: enable etcd-manager) the cluster is unusable (etcd-manager wants to upgrade from version 3.1.11 which is unsupported, but it was running 3.1.12 before enabling it).
upgrading etcd to 3.1.12, then upgrading kubernetes to 1.12.9 (with the Legacy provider), upgrade etcd to 3.2.24. At this point the cluster is running with the indicated etcd version. Removing the provider config from kops (which equals to enabling the etcd-manager) it ends up in the same situation, it identifies the previous version as 3.1.11 and etcd-manager bails.

In both cases, after enabling etcd-manager I did a rolling upgrade of all masters (and nodes for what it's worth, given I'm using calico) at the same time.

Re-adding the provider: Legacy brings the cluster back to a functional state. I'd rather fix this though because it sounds like kicking the can down the road.

I've tried this several times so at this time I'm afraid the etcd volumes are "dirty". I would be fine starting with a fresh, updated cluster (given I could restore the latest backup with velero) but I couldn't find a way to do this either.

@justinsb do you have any recommendations?

olemarkus commented 5 years ago

Sorry for not mentioning this earlier.

If you already tried to migrate to etcd-manager, but run into the failure above, you need to move back to legacy mode, then go into the etcd ESB volumes on every master and remove the etcd-manager files. If I remember correctly you should remove data , data-trashcan, and state and be left with var, k8s, and lost+found in both volumes.

elisiano commented 5 years ago

thanks @olemarkus I will give it another try (I did delete data-trashcan in one of the iterations but it didn't help, will try to compare it to an old cluster and do some cleanup). How about on the S3 bucket? is there anything that etcd-manager adds that would interfere with subsequent migrations?

elisiano commented 5 years ago

@olemarkus thank you so much, cleaning the EBS volumes worked!

dcodix commented 5 years ago

Thanks @olemarkus I was having the same exact problem as @elisiano .

I tested a full upgrade from 1.9.4 to 1.12.7. The steps that worked for me are:

k8s 1.9.4 -> k8s 1.10.9 -> k8s 1.11.9 then etcd 3.0.17 -> etcd 3.1.17 -> etcd 3.2.24 then k8s 1.11.9 -> k8s 1.12.7 etcd-manager off (Legacy) -> k8s 1.12.7 etcd-manager on

This path worked for me perfectly. I think my problem was trying to use etcd-manager with 3.0.17.

granular-ryanbonham commented 5 years ago

/assign @justinsb

johngmyers commented 5 years ago

Perhaps Provider: legacy should remove those etcd-manager files.

olemarkus commented 5 years ago

I don't think it is a good idea to remove those files. It would be better if etcd-manager could just update using a compatible version. It should know about what versions it can use and not just look for a binary that doesn't exist. E.g if it detects a 3.1.11 cluster, but the cluster spec has 3.1.12 set, just migrate using 3.1.12. If the spec has 3.2.18 set, migrate using 3.1.12 and then start a 3.2.18 cluster.

juris commented 5 years ago

Got the same issue with etcd 3.3.10. When I try to adopt etcd-manager and update one of the master nodes, it comes up and etcd-manager on that node throws the following messages to the log:

I0627 14:08:25.958274    2284 controller.go:417] mismatched version for peer peer{id:"etcd-c" endpoints:"10.113.101.223:3996" }: want "3.3.10", have "2.2.1"
I0627 14:08:27.372315    2284 controller.go:423] can't do in-place upgrade from "2.2.1" -> "3.3.10"
I0627 14:08:27.465854    2284 controller.go:489] etcd has unhealthy members, but no idle peers ready to join, so won't remove unhealthy members
I0627 14:08:27.517703    2284 reconcile_configuration.go:73] TLS configuration is disabled, won't enable TLS
2019-06-27 14:08:27.586407 I | raft: 22ffec229227f838 is starting a new election at term 8854
I0627 14:08:27.606087    2284 controller.go:517] detected that we need to upgrade/downgrade etcd
I0627 14:08:27.625820    2284 controller.go:526] upgrade/downgrade needed, but we don't have sufficient peers
2019-06-27 14:08:27.662626 I | raft: 22ffec229227f838 became candidate at term 8855
2019-06-27 14:08:27.662779 I | raft: 22ffec229227f838 received vote from 22ffec229227f838 at term 8855
2019-06-27 14:08:27.828814 I | raft: 22ffec229227f838 [logterm: 4849, index: 108963922] sent vote request to 79c0787284ca5bc1 at term 8855
2019-06-27 14:08:27.842881 I | raft: 22ffec229227f838 [logterm: 4849, index: 108963922] sent vote request to e67333250905a5e9 at term 8855
2019-06-27 14:08:28.678367 E | rafthttp: request sent was ignored by peer e67333250905a5e9 (server version incompatible)
2019-06-27 14:08:28.763939 E | rafthttp: request sent was ignored by peer e67333250905a5e9 (server version incompatible)
2019-06-27 14:08:28.763986 E | rafthttp: request sent was ignored by peer 79c0787284ca5bc1 (server version incompatible)

albertoal commented 5 years ago

We've also been hit by this. On our case, our etcd version was already 3.3.3, and we upgraded from kops/k8s 1.11 to 1.12. We are also Calico users so we rolled the masters at the same time as recommended in the upgrade guide.

Initially, our control plane wouldn't come up showing the following errors:

# etcd events
etcdserver.go:539] starting etcd with datadir /rootfs/mnt/master-vol-01e50f743e7dbe4b4/data/etcd-cluster-token-etcd
pki.go:39] generating peer keypair for etcd: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.redacted] IPs:[127.0.0.1]} Usages:[2 1]}
pki.go:77] building client-serving certificate: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.redacted] IPs:[127.0.0.1 127.0.0.1]} Usages:[1]}
3342 etcdserver.go:92] error running etcd: unknown etcd version v3.3.3: not found in [/opt/etcd-v3.3.3-linux-amd64]

# kube-apiserver (container wouldn't start)
storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry [https://127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt true true 1000 0xc420adf3b0 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:4001: connect: connection refused)

After this, we discovered this GitHub issue, and added provider: Legacy to our cluster definition. After rolling the masters again, the Control plane came up, but the entire cluster state aside from the kops-managed objects was lost so we had to restore all of our Kubernetes objects.

Thankfully this was on a non-prod cluster but we'd like to understand the issue a bit more before we consider upgrading the rest of our clusters. Could the data loss be caused because we attempted to start the cluster with etcd-manager on, then disabled it? Any clues will be appreciated!

alanmoment commented 5 years ago

I have same problem. My state is:

kops 1.12.3
k8s upgrade from 1.10.x to 1.12.10
etcd upgrade from 3.0.17 to 3.2.24

I got error message "/opt/etcd-v3.0.17-linux-amd64 not found" in etcd.log.

Then I download file in https://github.com/etcd-io/etcd/releases/download/v3.0.17/etcd-v3.0.17-linux-amd64.tar.gz and put in etcd-manage-main and etcd-manage-event container.

And executing cluster rolling-update, It's work now.

elblivion commented 5 years ago

I have the same problem.

Kops 1.13.1, cloud-provider AWS
Kubernetes 1.12.10
etcd upgrade from 2.2.1 to 3.2.4

Interestingly, we upgraded around 10 small pre-production clusters without a single issue. When we got around to upgrading clusters in production, we hit this problem on the first one.

We had EBS snapshots and were able to roll back, and on a second attempt it worked.

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

fejta-bot commented 4 years ago

Stale issues rot after 30d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten

fejta-bot commented 4 years ago

Rotten issues close after 30d of inactivity. Reopen the issue with /reopen. Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close

k8s-ci-robot commented 4 years ago

@fejta-bot: Closing this issue.

In response to [this](https://github.com/kubernetes/kops/issues/6990#issuecomment-599515518): >Rotten issues close after 30d of inactivity. >Reopen the issue with `/reopen`. >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Send feedback to sig-testing, kubernetes/test-infra and/or [fejta](https://github.com/fejta). >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

exequielrafaela commented 4 years ago

I have the same problem.

Kops 1.13.1, cloud-provider AWS

Kubernetes 1.12.10

etcd upgrade from 2.2.1 to 3.2.4

Interestingly, we upgraded around 10 small pre-production clusters without a single issue. When we got around to upgrading clusters in production, we hit this problem on the first one.

We had EBS snapshots and were able to roll back, and on a second attempt it worked.

Issue

K8s api logs

$ sudo tail -f kube-apiserver.log
I0323 22:43:47.796604       1 server.go:681] external host was not specified, using 172.18.64.111
I0323 22:43:47.797870       1 server.go:705] Initializing deserialization cache size based on 0MB limit
I0323 22:43:47.798003       1 server.go:724] Initializing cache sizes based on 0MB limit
I0323 22:43:47.798293       1 server.go:152] Version: v1.12.10
W0323 22:43:48.166697       1 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.
I0323 22:43:48.167373       1 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook.
I0323 22:43:48.167392       1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
W0323 22:43:48.167954       1 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.
I0323 22:43:48.168352       1 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook.
I0323 22:43:48.168366       1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
F0323 22:44:08.171850       1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry [https://127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt true true 1000 0xc420b06000 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:4001: connect: connection refused)

ectd couldn't talk to the k8s api. -> err (dial tcp 127.0.0.1:4001: connect: connection refused)

$ curl 127.0.0.1:4001
curl: (7) Failed to connect to 127.0.0.1 port 4001: Connection refused

etcd log

$ tail -f etcd.log
etcdClusterPeerInfo{peer=peer{id:"etcd-c" endpoints:"172.18.68.147:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-c" peer_urls:"https://etcd-c.internal.cluster-kops-1.k8s.devstg.demo.aws:2380" client_urls:"https://etcd-c.internal.cluster-kops-1.k8s.devstg.demo.aws:4001" quarantined_client_urls:"https://etcd-c.internal.cluster-kops-1.k8s.devstg.demo.aws:3994" > }
I0323 22:46:31.854692    3232 controller.go:277] etcd cluster members: map[]
I0323 22:46:31.854704    3232 controller.go:615] sending member map to all peers:
I0323 22:46:31.856202    3232 commands.go:22] not refreshing commands - TTL not hit
I0323 22:46:31.856227    3232 s3fs.go:220] Reading file "s3://3pt-state-cluster-kops-1.k8s.devstg.demo.aws/cluster-kops-1.k8s.devstg.demo.aws/backups/etcd/main/control/etcd-cluster-created"
I0323 22:46:31.888074    3232 controller.go:369] spec member_count:3 etcd_version:"3.3.13"
I0323 22:46:31.888137    3232 commands.go:25] refreshing commands
I0323 22:46:31.909745    3232 vfs.go:104] listed commands in s3://3pt-state-cluster-kops-1.k8s.devstg.demo.aws/cluster-kops-1.k8s.devstg.demo.aws/backups/etcd/main/control: 0 commands
I0323 22:46:31.909779    3232 s3fs.go:220] Reading file "s3://3pt-state-cluster-kops-1.k8s.devstg.demo.aws/cluster-kops-1.k8s.devstg.demo.aws/backups/etcd/main/control/etcd-cluster-spec"
W0323 22:46:31.919561    3232 controller.go:149] unexpected error running etcd cluster reconciliation loop: etcd has 0 members registered; must issue restore-backup command to proceed

Solution

Same as quoted at the begging, with the following versions though:

kops 1.12.3
k8s 1.12.10
etcd 3.3.13

:point_right: on a second attempt it worked :heavy_check_mark: :checkered_flag:

Considerations

It's pretty disappointing that the problem occurs because of some kind of timing issue?
Our guess is that if something happens before something else, it works, otherwise it's not, so some kind of dependency condition validation it's not enforced.
As stated on the before comments the problem seems to be with kops etcd-manager (etcd3 kops support from version >= 1.12.x.
We happen to be in a version of kops, kuber and etcd3 combination which is prone to error.
We haven't see this behaviour in later releases (>= 1.13.x), most probably has already been solved.

Ref-Link: https://github.com/kubernetes/kops/blob/master/docs/etcd3-migration.md

CC: @diego-ojeda-binbash @gdmlnx

white-hat commented 3 years ago

@fejta-bot /reopen

k8s-ci-robot commented 3 years ago

@white-hat: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/kops/issues/6990#issuecomment-736216691): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

white-hat commented 3 years ago

encountered the same issue when trying to enable "manager" provider with kops 1.17.2, k8s 1.12.7, etcd 3.2.24

etcd-manager is trying to launch version 3.1.11 of etcd and I could not find a way to override this behavior

I1201 04:11:41.528867    5491 etcdserver.go:534] starting etcd with state cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd-events" nodes:<name:"etcd-events-a" peer_urls:"http://etcd-events-a.internal.***:2381" client_urls:"http://0.0.0.0:4002" quarantined_client_urls:"http://0.0.0.0:3995" > nodes:<name:"etcd-events-b" peer_urls:"http://etcd-events-b.internal.***:2381" client_urls:"http://0.0.0.0:4002" quarantined_client_urls:"http://0.0.0.0:3995" > nodes:<name:"etcd-events-c" peer_urls:"http://etcd-events-c.internal.***:2381" client_urls:"http://0.0.0.0:4002" quarantined_client_urls:"http://0.0.0.0:3995" > > etcd_version:"3.1.11"

and

 etcdserver.go:96] error running etcd: unknown etcd version v3.1.11: not found in [/opt/etcd-v3.1.11-linux-amd64]

and

controller.go:417] mismatched version for peer peer{id:"etcd-events-b" endpoints:"***:3997" }: want "3.2.24", have "3.1.11"

Fortunately I did not recycle all the master nodes, so I was able to roll back the changes

Is there any chance this can be fixed? Or at least are there steps which can reliably mitigate with issue?

white-hat commented 3 years ago

Answering my own question: once you upgrade all master nodes simultaneously, all you need to do is:

ssh to each master node
cd to 2 mounted etcd volumes (one for main, another for events) cd /mnt/master-vol-02e4f7fb71a78b634/
delete a state file which has a wrong etcd version in it and some "trash" directory rm -rf state data-trashcan
reboot the node reboot

azharkhan2000 commented 1 year ago

I am Facing this issue in the Docker container when checking the logs in the etcd docker container.

unexpected error running etcd cluster reconciliation loop: etcd has 0 members registered; must issue a restore-backup command to proceed

Can anyone help me?

olemarkus commented 1 year ago

@azharkhan2000 I suggest you file a new github issue. This issue has been closed

kubernetes / kops