Closed olemarkus closed 4 years ago
Hmmm ... sorry about the issue and thanks for reporting. So the etcd pods were running 3.2.18, but somehow we found etcd 3.1.11? Which instructions did you follow for this?
Were you running etcd-backup with 3.1.11? I'm wondering if we picked up an old backup (we shouldn't have)?
I'm going to look at whether we can use 3.1.12 to restore the backup (which is in the image), or maybe even just including 3.1.11 (because it looks like it was recommended for 1.9). But I'm worried that we're trying to restore an old backup, which is a bigger problem...
Another possibility - did you try using etcd-manager in the past, and then switch it off (with this cluster)? Maybe we're finding that...
We have not tried to use etcd-manager on this exact cluster before but we did try etcd-manager on a cluster with the exact name before. This was some time ago and the cluster was since reprovisioned. So it could be that the kops deletion did not delete the backup?
That being said, we have a lot of clusters running 3.1.11 because of the old recommendation so supporting a direct upgrade path from 3.1.11 would be much appreciated.
I can't rule it out that we are looking at the backup, but we shouldn't be and I'd be really surprised given that it requires a "restore backup command" to be in place (but maybe if that command was in place from before...). If you're able to get me some more of the logs I might be able to tell.
https://github.com/kopeio/etcd-manager/blob/master/pkg/legacy/adopt.go#L51
That is the code path we're supposed to go down when first introducing etcd-manager - we look at the manifest on-disk (from the EBS volume) and extract the etcd-version from that. If you could SSH to one of the masters and get that manifest (at least the image bit!) from the EBS volume mount, that would be super helpful also.
I am looking at supporting using etcd 3.1.12 to read backups created with 3.1.11 ( https://github.com/kopeio/etcd-manager/pull/209 ). My real concern there is that this 3.1.11 backup is not what you want to restore.
I have some snippets around. Starting etcd-manager on a fresh node, I see a log line like this:
I0515 15:07:37.456694 3819 etcdserver.go:530] starting etcd with state cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"http://etcd-a.internal.foo.bar:2380" client_urls:"http://0.0.0.0:4001" quarantined_client_urls:"http://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"http://etcd-b.internal.foo.bar2380" client_urls:"http://0.0.0.0:4001" quarantined_client_urls:"http://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"http://etcd-c.internal.foo.bar:2380" client_urls:"http://0.0.0.0:4001" quarantined_client_urls:"http://0.0.0.0:3994" > > etcd_version:"3.1.11"
So I got curious where it got this data from. And this comes from the state file in etcd volume. I moved this file somewhere else on all masters, and rolled all the masters. The masters then came up with etcd 3.2.18 running.
The data in the cluster is definitely fresh from just before we tried the upgrade. Before moving that state file I did have a look at the s3 bucket and there were no backups there. Only the control folder. I don't think has been trying to restore anything from backup.
I'm seeing a similar error doing a 1.11 > 1.12.1 upgrade.
I0520 19:23:15.459382 5708 etcdserver.go:530] starting etcd with state cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name
:"etcd-a" peer_urls:"https://etcd-a.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994"
> nodes:<name:"etcd-b" peer_urls:"https://etcd-b.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https:/
/0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client
_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0"
I0520 19:23:15.459470 5708 etcdserver.go:539] starting etcd with datadir /rootfs/mnt/master-vol-0bb0a3db96a56648c/data/etcd-cluster-token-etcd
I0520 19:23:15.459516 5708 etcdserver.go:553] overriding clientURLs with [https://etcd-c.foo.bar:4001] (state had [https://0.0.0
.0:4001])
I0520 19:23:15.459530 5708 etcdserver.go:557] overriding quarantinedClientURLs with [https://etcd-c.foo.bar:3994] (state had [ht
tps://0.0.0.0:3994])
I0520 19:23:15.461461 5708 pki.go:39] generating peer keypair for etcd: {CommonName:etcd-c Organization:[] AltNames:{DNSNames:[etcd-c.internal.dev-kops01.
mysmithrx.com] IPs:[127.0.0.1]} Usages:[2 1]}
I0520 19:23:15.464632 5708 pki.go:77] building client-serving certificate: {CommonName:etcd-c Organization:[] AltNames:{DNSNames:[etcd-c.internal.foo.bar etcd-c.foo.bar] IPs:[127.0.0.1 127.0.0.1]} Usages:[1]}
W0520 19:23:15.784131 5708 etcdserver.go:92] error running etcd: unknown etcd version v3.2.18-0: not found in [/opt/etcd-v3.2.18-0-linux-amd64]
My difference is I have always been on etcd 3.2.18-0 for this cluster. It has always been this version. I tried moving my state file as @olemarkus suggested, but that resulted in a different error:
I0521 00:45:16.751546 22616 adopt.go:97] parsing legacy manifest /rootfs/mnt/master-vol-080c99d57c410f64c/k8s.io/manifests/etcd-events.manifest
I0521 00:45:16.751958 22616 adopt.go:209] copying etcd data from /rootfs/mnt/master-vol-080c99d57c410f64c/var/etcd/data-events -> /rootfs/mnt/master-vol-080c99d57c410f64c/data/etcd-cluster-token-etcd-events
error initializing etcd server: error reading source directory /rootfs/mnt/master-vol-080c99d57c410f64c/var/etcd/data-events: open /rootfs/mnt/master-vol-080c99d57c410f64c/var/etcd/data-events: no such file or directory
Same issue running with kops 1.12.1.
Cluster was initially started with 3.1.11. We first upgraded to 3.1.12. Then upgraded the cluster rolling all masters at the same time.
Still errors with
W0521 11:23:30.588313 3255 etcdserver.go:92] error running etcd: unknown etcd version v3.1.11: not found in [/opt/etcd-v3.1.11-linux-amd64]
If the cluster runs with 3.1.11, I get a similar message as above, but also
I0521 11:32:07.141850 3617 controller.go:369] spec member_count:3 etcd_version:"3.1.11"
W0521 11:32:07.141921 3617 controller.go:1008] we don't support etcd version requested, won't assume forward compatability: member_count:3 etcd_version:"3.1.11"
W0521 11:32:07.141932 3617 controller.go:149] unexpected error running etcd cluster reconciliation loop: can't act as leader for unknown etcd version "3.1.11"
Some additional info for you @justinsb;
1.11.9
and etcd version 3.2.18-0
- Everything was back up and working as expected.3.2.24
and rolled my masters. - Cluster comes up fineprovider: Manager
per the gradual updates docI get the same error about etcd-manager trying to grab the older 3.2.18-0
version.
Here's a fuller set of logs:
I0523 22:57:32.231490 5442 controller.go:173] starting controller iteration
I0523 22:57:32.231522 5442 controller.go:269] I am leader with token "CwS52JU9zpqMgqs4qO0IqQ"
W0523 22:57:37.234421 5442 controller.go:675] unable to reach member etcdClusterPeerInfo{peer=peer{id:"etcd-b" endpoints:"10.12.2.248:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://etcd-b.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-b.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }: error building etcd client for https://etcd-b.internal.foo.bar:4001: dial tcp 10.12.2.248:4001: connect: connection refused
I0523 22:57:39.518739 5442 etcdserver.go:530] starting etcd with state cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0"
I0523 22:57:39.518811 5442 etcdserver.go:539] starting etcd with datadir /rootfs/mnt/master-vol-0b639eae2b4e5bad2/data/etcd-cluster-token-etcd
I0523 22:57:39.518978 5442 etcdserver.go:553] overriding clientURLs with [https://etcd-a.internal.foo.bar:4001] (state had [https://0.0.0.0:4001])
I0523 22:57:39.518994 5442 etcdserver.go:557] overriding quarantinedClientURLs with [https://etcd-a.internal.foo.bar:3994] (state had [https://0.0.0.0:3994])
I0523 22:57:39.519158 5442 pki.go:39] generating peer keypair for etcd: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.internal.foo.bar] IPs:[127.0.0.1]} Usages:[2 1]}
I0523 22:57:39.519486 5442 pki.go:77] building client-serving certificate: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.internal.foo.bar etcd-a.internal.foo.bar] IPs:[127.0.0.1 127.0.0.1]} Usages:[1]}
W0523 22:57:39.628034 5442 etcdserver.go:92] error running etcd: unknown etcd version v3.2.18-0: not found in [/opt/etcd-v3.2.18-0-linux-amd64]
W0523 22:57:42.234716 5442 controller.go:675] unable to reach member etcdClusterPeerInfo{peer=peer{id:"etcd-c" endpoints:"10.12.3.223:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://etcd-c.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-c.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }: error building etcd client for https://etcd-c.internal.foo.bar:4001: dial tcp 10.12.3.223:4001: connect: connection refused
W0523 22:57:47.234992 5442 controller.go:675] unable to reach member etcdClusterPeerInfo{peer=peer{id:"etcd-a" endpoints:"10.12.1.27:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://etcd-a.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-a.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }: error building etcd client for https://etcd-a.internal.foo.bar:4001: dial tcp 10.12.1.27:4001: connect: connection refused
I0523 22:57:47.235084 5442 controller.go:276] etcd cluster state: etcdClusterState
members:
peers:
etcdClusterPeerInfo{peer=peer{id:"etcd-b" endpoints:"10.12.2.248:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://etcd-b.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-b.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }
etcdClusterPeerInfo{peer=peer{id:"etcd-c" endpoints:"10.12.3.223:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://etcd-c.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-c.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }
etcdClusterPeerInfo{peer=peer{id:"etcd-a" endpoints:"10.12.1.27:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://etcd-a.internal.foo.bar:4001" quarantined_client_urls:"https://etcd-a.internal.foo.bar:3994" > etcd_state:<cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0" > }
I0523 22:57:47.235331 5442 controller.go:277] etcd cluster members: map[]
I0523 22:57:47.235352 5442 controller.go:615] sending member map to all peers: members:<name:"etcd-b" dns:"etcd-b.internal.foo.bar" addresses:"10.12.2.248" > members:<name:"etcd-c" dns:"etcd-c.internal.foo.bar" addresses:"10.12.3.223" > members:<name:"etcd-a" dns:"etcd-a.internal.foo.bar" addresses:"10.12.1.27" >
I0523 22:57:47.235703 5442 etcdserver.go:222] updating hosts: map[10.12.1.27:[etcd-a.internal.foo.bar] 10.12.2.248:[etcd-b.internal.foo.bar] 10.12.3.223:[etcd-c.internal.foo.bar]]
I0523 22:57:47.235726 5442 hosts.go:84] hosts update: primary=map[10.12.1.27:[etcd-a.internal.foo.bar] 10.12.2.248:[etcd-b.internal.foo.bar] 10.12.3.223:[etcd-c.internal.foo.bar]], fallbacks=map[etcd-a.internal.foo.bar:[10.12.1.27 10.12.1.27] etcd-b.internal.foo.bar:[10.12.2.248 10.12.2.248] etcd-c.internal.foo.bar:[10.12.3.223 10.12.3.223]], final=map[10.12.1.27:[etcd-a.internal.foo.bar] 10.12.2.248:[etcd-b.internal.foo.bar] 10.12.3.223:[etcd-c.internal.foo.bar]]
I0523 22:57:47.276273 5442 commands.go:22] not refreshing commands - TTL not hit
I0523 22:57:47.276312 5442 s3fs.go:220] Reading file "s3://foo-kops/foo.bar/backups/etcd/main/control/etcd-cluster-created"
I0523 22:57:47.599148 5442 controller.go:369] spec member_count:3 etcd_version:"3.2.24"
I0523 22:57:47.599201 5442 controller.go:417] mismatched version for peer peer{id:"etcd-a" endpoints:"10.12.1.27:3996" }: want "3.2.24", have "3.2.18-0"
I0523 22:57:47.599295 5442 controller.go:417] mismatched version for peer peer{id:"etcd-b" endpoints:"10.12.2.248:3996" }: want "3.2.24", have "3.2.18-0"
I0523 22:57:47.599340 5442 controller.go:417] mismatched version for peer peer{id:"etcd-c" endpoints:"10.12.3.223:3996" }: want "3.2.24", have "3.2.18-0"
I0523 22:57:47.599391 5442 commands.go:25] refreshing commands
I0523 22:57:47.759340 5442 vfs.go:104] listed commands in s3://foo-kops/foo.bar/backups/etcd/main/control: 0 commands
I0523 22:57:47.759362 5442 s3fs.go:220] Reading file "s3://foo-kops/foo.bar/backups/etcd/main/control/etcd-cluster-spec"
W0523 22:57:47.840027 5442 controller.go:149] unexpected error running etcd cluster reconciliation loop: etcd has 0 members registered; must issue restore-backup command to proceed
I0523 22:57:49.628145 5442 etcdserver.go:530] starting etcd with state cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd" nodes:<name:"etcd-a" peer_urls:"https://etcd-a.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-b" peer_urls:"https://etcd-b.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > nodes:<name:"etcd-c" peer_urls:"https://etcd-c.internal.foo.bar:2380" client_urls:"https://0.0.0.0:4001" quarantined_client_urls:"https://0.0.0.0:3994" > > etcd_version:"3.2.18-0"
I0523 22:57:49.628204 5442 etcdserver.go:539] starting etcd with datadir /rootfs/mnt/master-vol-0b639eae2b4e5bad2/data/etcd-cluster-token-etcd
I0523 22:57:49.628241 5442 etcdserver.go:553] overriding clientURLs with [https://etcd-a.internal.foo.bar:4001] (state had [https://0.0.0.0:4001])
I0523 22:57:49.628250 5442 etcdserver.go:557] overriding quarantinedClientURLs with [https://etcd-a.internal.foo.bar:3994] (state had [https://0.0.0.0:3994])
I0523 22:57:49.628424 5442 pki.go:39] generating peer keypair for etcd: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.internal.foo.bar] IPs:[127.0.0.1]} Usages:[2 1]}
I0523 22:57:49.628795 5442 pki.go:77] building client-serving certificate: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.internal.foo.bar etcd-a.internal.foo.bar] IPs:[127.0.0.1 127.0.0.1]} Usages:[1]}
W0523 22:57:49.684957 5442 etcdserver.go:92] error running etcd: unknown etcd version v3.2.18-0: not found in [/opt/etcd-v3.2.18-0-linux-amd64]
I0523 22:57:55.892405 5442 volumes.go:85] AWS API Request: ec2/DescribeVolumes
I0523 22:57:55.966884 5442 volumes.go:85] AWS API Request: ec2/DescribeInstances
I0523 22:57:56.019508 5442 hosts.go:84] hosts update: primary=map[10.12.1.27:[etcd-a.internal.foo.bar] 10.12.2.248:[etcd-b.internal.foo.bar] 10.12.3.223:[etcd-c.internal.foo.bar]], fallbacks=map[etcd-a.internal.foo.bar:[10.12.1.27 10.12.1.27] etcd-b.internal.foo.bar:[10.12.2.248 10.12.2.248] etcd-c.internal.foo.bar:[10.12.3.223 10.12.3.223]], final=map[10.12.1.27:[etcd-a.internal.foo.bar] 10.12.2.248:[etcd-b.internal.foo.bar] 10.12.3.223:[etcd-c.internal.foo.bar]]
I0523 22:57:56.019595 5442 hosts.go:181] skipping update of unchanged /etc/hosts
I got hit by this as well, I was using 3.1.11, upgraded etcd to 3.1.12 before attempting a cluster upgrade but etcd-manager found only 3.1.11.
what is the right way to upgrade/downgrade etcd inside the same cluster? (I could not find specific docs).
With kops 1.12.1. we set provider: Legacy
on both etcd clusters and version: 3.1.12
, then roll masters gracefully. Once the masters have been rolled, we remove these entries and roll all masters simultaneously.
@olemarkus thanks for the pointer.
I tried several scenarios but all end up in the same state when I enable etcd-manager
.
Initial versions:
From here I tried:
Legacy
provider). The cluster upgrades successfully but as soon as I remove the provider entry (ie: enable etcd-manager) the cluster is unusable (etcd-manager wants to upgrade from version 3.1.11 which is unsupported, but it was running 3.1.12 before enabling it).Legacy
provider), upgrade etcd to 3.2.24. At this point the cluster is running with the indicated etcd version. Removing the provider config from kops (which equals to enabling the etcd-manager) it ends up in the same situation, it identifies the previous version as 3.1.11 and etcd-manager bails.In both cases, after enabling etcd-manager I did a rolling upgrade of all masters (and nodes for what it's worth, given I'm using calico) at the same time.
Re-adding the provider: Legacy
brings the cluster back to a functional state. I'd rather fix this though because it sounds like kicking the can down the road.
I've tried this several times so at this time I'm afraid the etcd volumes are "dirty". I would be fine starting with a fresh, updated cluster (given I could restore the latest backup with velero) but I couldn't find a way to do this either.
@justinsb do you have any recommendations?
Sorry for not mentioning this earlier.
If you already tried to migrate to etcd-manager, but run into the failure above, you need to move back to legacy mode, then go into the etcd ESB volumes on every master and remove the etcd-manager files. If I remember correctly you should remove data
, data-trashcan
, and state
and be left with var
, k8s
, and lost+found
in both volumes.
thanks @olemarkus I will give it another try (I did delete data-trashcan
in one of the iterations but it didn't help, will try to compare it to an old cluster and do some cleanup).
How about on the S3 bucket? is there anything that etcd-manager adds that would interfere with subsequent migrations?
@olemarkus thank you so much, cleaning the EBS volumes worked!
Thanks @olemarkus I was having the same exact problem as @elisiano .
I tested a full upgrade from 1.9.4 to 1.12.7. The steps that worked for me are:
k8s 1.9.4 -> k8s 1.10.9 -> k8s 1.11.9 then etcd 3.0.17 -> etcd 3.1.17 -> etcd 3.2.24 then k8s 1.11.9 -> k8s 1.12.7 etcd-manager off (Legacy) -> k8s 1.12.7 etcd-manager on
This path worked for me perfectly. I think my problem was trying to use etcd-manager with 3.0.17.
/assign @justinsb
Perhaps Provider: legacy
should remove those etcd-manager files.
I don't think it is a good idea to remove those files. It would be better if etcd-manager could just update using a compatible version. It should know about what versions it can use and not just look for a binary that doesn't exist. E.g if it detects a 3.1.11 cluster, but the cluster spec has 3.1.12 set, just migrate using 3.1.12. If the spec has 3.2.18 set, migrate using 3.1.12 and then start a 3.2.18 cluster.
Got the same issue with etcd 3.3.10. When I try to adopt etcd-manager and update one of the master nodes, it comes up and etcd-manager on that node throws the following messages to the log:
I0627 14:08:25.958274 2284 controller.go:417] mismatched version for peer peer{id:"etcd-c" endpoints:"10.113.101.223:3996" }: want "3.3.10", have "2.2.1"
I0627 14:08:27.372315 2284 controller.go:423] can't do in-place upgrade from "2.2.1" -> "3.3.10"
I0627 14:08:27.465854 2284 controller.go:489] etcd has unhealthy members, but no idle peers ready to join, so won't remove unhealthy members
I0627 14:08:27.517703 2284 reconcile_configuration.go:73] TLS configuration is disabled, won't enable TLS
2019-06-27 14:08:27.586407 I | raft: 22ffec229227f838 is starting a new election at term 8854
I0627 14:08:27.606087 2284 controller.go:517] detected that we need to upgrade/downgrade etcd
I0627 14:08:27.625820 2284 controller.go:526] upgrade/downgrade needed, but we don't have sufficient peers
2019-06-27 14:08:27.662626 I | raft: 22ffec229227f838 became candidate at term 8855
2019-06-27 14:08:27.662779 I | raft: 22ffec229227f838 received vote from 22ffec229227f838 at term 8855
2019-06-27 14:08:27.828814 I | raft: 22ffec229227f838 [logterm: 4849, index: 108963922] sent vote request to 79c0787284ca5bc1 at term 8855
2019-06-27 14:08:27.842881 I | raft: 22ffec229227f838 [logterm: 4849, index: 108963922] sent vote request to e67333250905a5e9 at term 8855
2019-06-27 14:08:28.678367 E | rafthttp: request sent was ignored by peer e67333250905a5e9 (server version incompatible)
2019-06-27 14:08:28.763939 E | rafthttp: request sent was ignored by peer e67333250905a5e9 (server version incompatible)
2019-06-27 14:08:28.763986 E | rafthttp: request sent was ignored by peer 79c0787284ca5bc1 (server version incompatible)
We've also been hit by this. On our case, our etcd version was already 3.3.3, and we upgraded from kops/k8s 1.11 to 1.12. We are also Calico users so we rolled the masters at the same time as recommended in the upgrade guide.
Initially, our control plane wouldn't come up showing the following errors:
# etcd events
etcdserver.go:539] starting etcd with datadir /rootfs/mnt/master-vol-01e50f743e7dbe4b4/data/etcd-cluster-token-etcd
pki.go:39] generating peer keypair for etcd: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.redacted] IPs:[127.0.0.1]} Usages:[2 1]}
pki.go:77] building client-serving certificate: {CommonName:etcd-a Organization:[] AltNames:{DNSNames:[etcd-a.redacted] IPs:[127.0.0.1 127.0.0.1]} Usages:[1]}
3342 etcdserver.go:92] error running etcd: unknown etcd version v3.3.3: not found in [/opt/etcd-v3.3.3-linux-amd64]
# kube-apiserver (container wouldn't start)
storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry [https://127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt true true 1000 0xc420adf3b0 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:4001: connect: connection refused)
After this, we discovered this GitHub issue, and added provider: Legacy
to our cluster definition. After rolling the masters again, the Control plane came up, but the entire cluster state aside from the kops-managed objects was lost so we had to restore all of our Kubernetes objects.
Thankfully this was on a non-prod cluster but we'd like to understand the issue a bit more before we consider upgrading the rest of our clusters. Could the data loss be caused because we attempted to start the cluster with etcd-manager on, then disabled it? Any clues will be appreciated!
I have same problem. My state is:
I got error message "/opt/etcd-v3.0.17-linux-amd64 not found" in etcd.log.
Then I download file in https://github.com/etcd-io/etcd/releases/download/v3.0.17/etcd-v3.0.17-linux-amd64.tar.gz and put in etcd-manage-main and etcd-manage-event container.
And executing cluster rolling-update, It's work now.
I have the same problem.
Interestingly, we upgraded around 10 small pre-production clusters without a single issue. When we got around to upgrading clusters in production, we hit this problem on the first one.
We had EBS snapshots and were able to roll back, and on a second attempt it worked.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /close
@fejta-bot: Closing this issue.
I have the same problem.
- Kops 1.13.1, cloud-provider AWS
- Kubernetes 1.12.10
- etcd upgrade from 2.2.1 to 3.2.4
Interestingly, we upgraded around 10 small pre-production clusters without a single issue. When we got around to upgrading clusters in production, we hit this problem on the first one.
We had EBS snapshots and were able to roll back, and on a second attempt it worked.
K8s api logs
$ sudo tail -f kube-apiserver.log
I0323 22:43:47.796604 1 server.go:681] external host was not specified, using 172.18.64.111
I0323 22:43:47.797870 1 server.go:705] Initializing deserialization cache size based on 0MB limit
I0323 22:43:47.798003 1 server.go:724] Initializing cache sizes based on 0MB limit
I0323 22:43:47.798293 1 server.go:152] Version: v1.12.10
W0323 22:43:48.166697 1 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.
I0323 22:43:48.167373 1 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook.
I0323 22:43:48.167392 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
W0323 22:43:48.167954 1 admission.go:76] PersistentVolumeLabel admission controller is deprecated. Please remove this controller from your configuration files and scripts.
I0323 22:43:48.168352 1 plugins.go:158] Loaded 10 mutating admission controller(s) successfully in the following order: NamespaceLifecycle,LimitRanger,ServiceAccount,NodeRestriction,TaintNodesByCondition,Priority,DefaultTolerationSeconds,PersistentVolumeLabel,DefaultStorageClass,MutatingAdmissionWebhook.
I0323 22:43:48.168366 1 plugins.go:161] Loaded 6 validating admission controller(s) successfully in the following order: LimitRanger,ServiceAccount,Priority,PersistentVolumeClaimResize,ValidatingAdmissionWebhook,ResourceQuota.
F0323 22:44:08.171850 1 storage_decorator.go:57] Unable to create storage backend: config (&{etcd3 /registry [https://127.0.0.1:4001] /etc/kubernetes/pki/kube-apiserver/etcd-client.key /etc/kubernetes/pki/kube-apiserver/etcd-client.crt /etc/kubernetes/pki/kube-apiserver/etcd-ca.crt true true 1000 0xc420b06000 <nil> 5m0s 1m0s}), err (dial tcp 127.0.0.1:4001: connect: connection refused)
ectd couldn't talk to the k8s api. -> err (dial tcp 127.0.0.1:4001: connect: connection refused)
$ curl 127.0.0.1:4001
curl: (7) Failed to connect to 127.0.0.1 port 4001: Connection refused
$ tail -f etcd.log
etcdClusterPeerInfo{peer=peer{id:"etcd-c" endpoints:"172.18.68.147:3996" }, info=cluster_name:"etcd" node_configuration:<name:"etcd-c" peer_urls:"https://etcd-c.internal.cluster-kops-1.k8s.devstg.demo.aws:2380" client_urls:"https://etcd-c.internal.cluster-kops-1.k8s.devstg.demo.aws:4001" quarantined_client_urls:"https://etcd-c.internal.cluster-kops-1.k8s.devstg.demo.aws:3994" > }
I0323 22:46:31.854692 3232 controller.go:277] etcd cluster members: map[]
I0323 22:46:31.854704 3232 controller.go:615] sending member map to all peers:
I0323 22:46:31.856202 3232 commands.go:22] not refreshing commands - TTL not hit
I0323 22:46:31.856227 3232 s3fs.go:220] Reading file "s3://3pt-state-cluster-kops-1.k8s.devstg.demo.aws/cluster-kops-1.k8s.devstg.demo.aws/backups/etcd/main/control/etcd-cluster-created"
I0323 22:46:31.888074 3232 controller.go:369] spec member_count:3 etcd_version:"3.3.13"
I0323 22:46:31.888137 3232 commands.go:25] refreshing commands
I0323 22:46:31.909745 3232 vfs.go:104] listed commands in s3://3pt-state-cluster-kops-1.k8s.devstg.demo.aws/cluster-kops-1.k8s.devstg.demo.aws/backups/etcd/main/control: 0 commands
I0323 22:46:31.909779 3232 s3fs.go:220] Reading file "s3://3pt-state-cluster-kops-1.k8s.devstg.demo.aws/cluster-kops-1.k8s.devstg.demo.aws/backups/etcd/main/control/etcd-cluster-spec"
W0323 22:46:31.919561 3232 controller.go:149] unexpected error running etcd cluster reconciliation loop: etcd has 0 members registered; must issue restore-backup command to proceed
Same as quoted at the begging, with the following versions though:
:point_right: on a second attempt it worked
:heavy_check_mark: :checkered_flag:
>= 1.12.x
.>= 1.13.x
), most probably has already been solved.Ref-Link: https://github.com/kubernetes/kops/blob/master/docs/etcd3-migration.md
CC: @diego-ojeda-binbash @gdmlnx
@fejta-bot /reopen
@white-hat: You can't reopen an issue/PR unless you authored it or you are a collaborator.
encountered the same issue when trying to enable "manager" provider with kops 1.17.2, k8s 1.12.7, etcd 3.2.24
etcd-manager is trying to launch version 3.1.11 of etcd and I could not find a way to override this behavior
I1201 04:11:41.528867 5491 etcdserver.go:534] starting etcd with state cluster:<desired_cluster_size:3 cluster_token:"etcd-cluster-token-etcd-events" nodes:<name:"etcd-events-a" peer_urls:"http://etcd-events-a.internal.***:2381" client_urls:"http://0.0.0.0:4002" quarantined_client_urls:"http://0.0.0.0:3995" > nodes:<name:"etcd-events-b" peer_urls:"http://etcd-events-b.internal.***:2381" client_urls:"http://0.0.0.0:4002" quarantined_client_urls:"http://0.0.0.0:3995" > nodes:<name:"etcd-events-c" peer_urls:"http://etcd-events-c.internal.***:2381" client_urls:"http://0.0.0.0:4002" quarantined_client_urls:"http://0.0.0.0:3995" > > etcd_version:"3.1.11"
and
etcdserver.go:96] error running etcd: unknown etcd version v3.1.11: not found in [/opt/etcd-v3.1.11-linux-amd64]
and
controller.go:417] mismatched version for peer peer{id:"etcd-events-b" endpoints:"***:3997" }: want "3.2.24", have "3.1.11"
Fortunately I did not recycle all the master nodes, so I was able to roll back the changes
Is there any chance this can be fixed? Or at least are there steps which can reliably mitigate with issue?
Answering my own question: once you upgrade all master nodes simultaneously, all you need to do is:
cd /mnt/master-vol-02e4f7fb71a78b634/
rm -rf state data-trashcan
reboot
I am Facing this issue in the Docker container when checking the logs in the etcd docker container.
unexpected error running etcd cluster reconciliation loop: etcd has 0 members registered; must issue a restore-backup command to proceed
Can anyone help me?
@azharkhan2000 I suggest you file a new github issue. This issue has been closed
1. What
kops
version are you running? The commandkops version
, will display this information.kops 1.12
2. What Kubernetes version are you running?
kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag.1.12.7
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
First created a cluster with etcd 3.1.11. Then per upgrade docs, migrated to 3.2.18. Confirmed the etcd pods were indeed running 3.2.18 Rolled masters fast following the guide.
5. What happened after the commands executed?
The masters did not come online. In
/var/log/etcd.log
I see entries like this: