Closed mberdnikov closed 5 years ago
Can you elaborate on what On an existing cluster created by version 0.1.17-rc4 using version 0.2.0 with change all internal_address.
exactly means? I can't determine what steps to use to reproduce this?
Does this also reproduce when usin 0.1.17 (non RC) and 0.2.0?
Hello @superseb !
My steps were:
10.16.2.x
=> 10.17.6.x
).rke up
(version 0.1.17 rc4)2019-03-26 11:07:16.746979 W | rafthttp: health check for peer 1e68cb5571d2857f could not connect: dial tcp 10.16.2.25:2380: getsockopt: connection refused
2019-03-26 11:07:16.747123 W | rafthttp: health check for peer cab55970e9e091b4 could not connect: dial tcp 10.16.2.26:2380: getsockopt: connection refused
2019-03-26 11:07:19.986555 I | raft: ddf8001d91bec098 is starting a new election at term 405
2019-03-26 11:07:19.986886 I | raft: ddf8001d91bec098 became candidate at term 406
2019-03-26 11:07:19.986922 I | raft: ddf8001d91bec098 received MsgVoteResp from ddf8001d91bec098 at term 406
2019-03-26 11:07:19.986936 I | raft: ddf8001d91bec098 [logterm: 370, index: 10228338] sent MsgVote request to 1e68cb5571d2857f at term 406
2019-03-26 11:07:19.986948 I | raft: ddf8001d91bec098 [logterm: 370, index: 10228338] sent MsgVote request to cab55970e9e091b4 at term 406
2019-03-26 11:07:21.740463 E | etcdserver: publish error: etcdserver: request timed out
2019-03-26 11:07:21.747335 W | rafthttp: health check for peer 1e68cb5571d2857f could not connect: dial tcp 10.16.2.25:2380: getsockopt: connection refused
2019-03-26 11:07:21.747500 W | rafthttp: health check for peer cab55970e9e091b4 could not connect: dial tcp 10.16.2.26:2380: getsockopt: connection refused
2019-03-26 11:07:26.747734 W | rafthttp: health check for peer 1e68cb5571d2857f could not connect: dial tcp 10.16.2.25:2380: getsockopt: connection refused
2019-03-26 11:07:26.748024 W | rafthttp: health check for peer cab55970e9e091b4 could not connect: dial tcp 10.16.2.26:2380: getsockopt: connection refused
10.17.6.x
=> 10.16.2.x
) to make a backup.10.16.2.x
=> 10.17.6.x
) and rke up
(version 0.2.0).2019-03-26 12:41:13.428377 I | etcdmain: rejected connection from "10.17.6.26:53324" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.486180 I | etcdmain: rejected connection from "10.17.6.25:54038" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.488732 I | etcdmain: rejected connection from "10.17.6.25:54040" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.535268 I | etcdmain: rejected connection from "10.17.6.26:53332" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.535613 I | etcdmain: rejected connection from "10.17.6.26:53330" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.594807 I | etcdmain: rejected connection from "10.17.6.25:54070" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.598035 I | etcdmain: rejected connection from "10.17.6.25:54062" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.650692 I | etcdmain: rejected connection from "10.17.6.26:53338" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.655358 I | etcdmain: rejected connection from "10.17.6.26:53340" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.706482 I | etcdmain: rejected connection from "10.17.6.25:54082" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.706742 I | etcdmain: rejected connection from "10.17.6.25:54080" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.757674 I | etcdmain: rejected connection from "10.17.6.26:53346" (error "remote error: tls: bad certificate", ServerName "")
docker rm --force etcd
on each node), deleted all /etc/kubernetes/ssl/kube-etcd-*
certificates, cleared /var/lib/etcd
rke up
(version 0.2.0)No change.
rke up
(version 0.1.17)No change.
Now I will try to completely remove the certificates. Perhaps kube-ca made by rke version 0.1.17-rc4 is not compatible with 0.2.0. openssl verify
on old certificates was successful.
Tomorrow I will try to play on another cluster.
After deleting all the certificates, it turned out to deploy a working etcd. But backup is not restored.
panic: runtime error: index out of range
goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Bucket).node(0xc4201e60f8, 0x33313a36343a3630, 0x0, 0x0)
/tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/bucket.go:660 +0x231
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Cursor).node(0xc4201c9528, 0x12420f4)
/tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/cursor.go:369 +0x1e3
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Bucket).CreateBucket(0xc4201e60f8, 0x12420f4, 0x5, 0x5, 0xc4201e8f68, 0xc4201c9608, 0xb03343)
/tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/bucket.go:185 +0x33e
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).CreateBucket(0xc4201e60e0, 0x12420f4, 0x5, 0x5, 0xc4201c9650, 0x40e756, 0x7f18605c1528)
/tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:108 +0x4f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTx).UnsafeCreateBucket(0xc4201f6e10, 0x12420f4, 0x5, 0x5)
/tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:49 +0x6b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/lease.(*lessor).initAndRecover(0xc420282960)
...
So tomorrow I will make a new cluster.
@MarkBerdnikov I know you've made a new cluster.
Typically, we don't test/recommend upgrading between RCs to a later version as those are not tested paths. If you still face issues, please open a new issue.
+1
I ran into a similar issue on a cluster which had been shut down for over a year. The following got it back up and running for me:
Not sure if it would have solved this issue with altered node IPs, but hope it helps someone.
RKE version: v0.2.0
Docker version:
Operating system and kernel:
Type/provider of hosts: Hetzner Cloud
cluster.yml file:
Steps to Reproduce:
On an existing cluster created by version 0.1.17-rc4 using version 0.2.0 with change all internal_address.
... a couple of hours of pain due to non-updated etcd cluster members ...
on each master nodes:
then
Results: