Closed branttaylor closed 1 year ago
Please supply exact steps/commands to reproduce, if I run the scenario you described, I see a working 3 node etcd cluster but maybe I'm not following the right steps. The only thing I can found on this error is a node with existing etcd data dir content (which should not be there if you use a new node, similar to what I did)
We have an existing set of RKE clusters that have been running for over 2 years, so it's not a brand new cluster that I am working with. Exact steps that I am observing:
cluster.yml
rke up
, which completes successfully and removes this master from our k8s clusterterraform apply
, which re-creates our Ubuntu 20.04 VMcluster.yml
rke up
, which completes successfully and puts the master back into our k8s cluster in a Ready
status
Note: While the rke up
does complete successfully, it outputs this log message during the run:
time="2021-04-14T00:46:48Z" level=warning msg="[etcd] host [172.27.26.100] failed to check etcd health: failed to get /health for host [172.27.26.100]: Get \"https://172.27.26.100:2379/health\": Unable to access the service on 172.27.26.100:2379. The service might be still starting up. Error: ssh: rejected: connect failed (Connection refused)"
docker ps
and see that the etcd container is restarting over and over and is showing the log messages that I posted aboveI'm assuming that etcd has gotten into a very bad state, but I don't know what to do to resolve it. Other things I've tried:
rke etcd restore
to get all 3 masters working in the same etcd cluster, then try to replace master 1 again. Same result.docker exec etcd etcdctl member add dmz-kub-mas-s1 --peer-urls=https://172.27.26.100:2380
from master 2 to see if I can manually force master 1 into the cluster. Does not work, I assume because the etcd container on master 1 is continuously restarting. Ok I will run it again, can you confirm that the recreated VM is clean and does not contain any data in the locations that RKE uses (in this case, particularly /var/lib/etcd
)
Yes, we just double-checked our VM and found that this folder does not exist in it.
Can you share the following so I can compare them to my setup:
ls -la /etc/kubernetes/ssl
ls -la /var/lib/etcd
Sorry for the delay. We had to turn our attention elsewhere because we were bit by with the issue outlined in this blog post:
https://support.rancher.com/hc/en-us/articles/360058516672
Now that we're past that, here's the requested info!
70f983b4851be2a1, started, etcd-dmz-kub-mas-s3, https://172.27.26.102:2380, https://172.27.26.102:2379, false
ae8bf7c4437f1aec, started, etcd-dmz-kub-mas-s2, https://172.27.26.101:2380, https://172.27.26.101:2379, false
root@dmz-kub-mas-s2:/# docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint status --write-out table
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://172.27.26.102:2379 | 70f983b4851be2a1 | 3.4.14 | 92 MB | false | false | 217 | 8591875 | 8591875 | |
| https://172.27.26.101:2379 | ae8bf7c4437f1aec | 3.4.14 | 92 MB | true | false | 217 | 8591875 | 8591875 | |
+----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
root@dmz-kub-mas-s2:/# docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint health
https://172.27.26.101:2379 is healthy: successfully committed proposal: took = 13.946967ms
https://172.27.26.102:2379 is healthy: successfully committed proposal: took = 14.725902ms
root@dmz-kub-mas-s2:/# ls -la /etc/kubernetes/ssl
total 124
drwxr-xr-x 2 root root 4096 Apr 9 14:51 .
drwxr-x--- 4 root root 67 Apr 29 17:48 ..
-rw------- 1 root root 1679 Apr 19 16:19 kube-apiserver-key.pem
-rw------- 1 root root 1679 Apr 19 16:19 kube-apiserver-proxy-client-key.pem
-rw------- 1 root root 1107 Apr 19 16:19 kube-apiserver-proxy-client.pem
-rw------- 1 root root 1675 Apr 19 16:19 kube-apiserver-requestheader-ca-key.pem
-rw------- 1 root root 1082 Apr 19 16:19 kube-apiserver-requestheader-ca.pem
-rw------- 1 root root 1407 Apr 19 16:19 kube-apiserver.pem
-rw------- 1 root root 1679 Apr 19 16:19 kube-ca-key.pem
-rw------- 1 root root 1017 Apr 19 16:19 kube-ca.pem
-rw------- 1 root root 1679 Apr 19 16:19 kube-controller-manager-key.pem
-rw------- 1 root root 1062 Apr 19 16:19 kube-controller-manager.pem
-rw------- 1 root root 1679 Apr 9 14:51 kube-etcd-172-27-26-100-key.pem
-rw------- 1 root root 1318 Apr 9 14:51 kube-etcd-172-27-26-100.pem
-rw------- 1 root root 1675 Apr 19 16:19 kube-etcd-172-27-26-101-key.pem
-rw------- 1 root root 1289 Apr 19 16:19 kube-etcd-172-27-26-101.pem
-rw------- 1 root root 1679 Apr 19 16:19 kube-etcd-172-27-26-102-key.pem
-rw------- 1 root root 1289 Apr 19 16:19 kube-etcd-172-27-26-102.pem
-rw------- 1 root root 1675 Apr 19 16:19 kube-node-key.pem
-rw------- 1 root root 1070 Apr 19 16:19 kube-node.pem
-rw------- 1 root root 1675 Apr 19 16:19 kube-proxy-key.pem
-rw------- 1 root root 1046 Apr 19 16:19 kube-proxy.pem
-rw------- 1 root root 1675 Apr 19 16:19 kube-scheduler-key.pem
-rw------- 1 root root 1050 Apr 19 16:19 kube-scheduler.pem
-rw------- 1 root root 1679 Apr 19 16:19 kube-service-account-token-key.pem
-rw------- 1 root root 1379 Apr 19 16:19 kube-service-account-token.pem
-rw------- 1 root root 517 Apr 9 14:51 kubecfg-kube-apiserver-proxy-client.yaml
-rw------- 1 root root 533 Apr 9 14:51 kubecfg-kube-apiserver-requestheader-ca.yaml
-rw------- 1 root root 501 Apr 9 14:51 kubecfg-kube-controller-manager.yaml
-rw------- 1 root root 445 Apr 9 14:51 kubecfg-kube-node.yaml
-rw------- 1 root root 449 Apr 9 14:51 kubecfg-kube-proxy.yaml
-rw------- 1 root root 465 Apr 9 14:51 kubecfg-kube-scheduler.yaml
root@dmz-kub-mas-s2:/# ls -la /var/lib/etcd
total 4
drwx------ 3 root root 20 Apr 19 16:35 .
drwxr-xr-x 47 root root 4096 Apr 9 13:18 ..
drwx------ 4 root root 29 Apr 19 16:35 member
Note that docker exec etcd
commands on master 1 all fail because the container will not stay running. The last time we attempted to add this master was Mon, 19 Apr 2021 16:38:09 -0500
according to a node describe.
The other requested info from that master:
root@dmz-kub-mas-s1:/# ls -la /etc/kubernetes/ssl
total 140
drwxr-xr-x 2 root root 4096 Apr 19 16:34 .
drwxr-x--- 3 root root 12288 Apr 29 17:48 ..
-rw------- 1 root root 1679 Apr 19 16:34 kube-apiserver-key.pem
-rw------- 1 root root 1679 Apr 19 16:34 kube-apiserver-proxy-client-key.pem
-rw------- 1 root root 1107 Apr 19 16:34 kube-apiserver-proxy-client.pem
-rw------- 1 root root 1675 Apr 19 16:34 kube-apiserver-requestheader-ca-key.pem
-rw------- 1 root root 1082 Apr 19 16:34 kube-apiserver-requestheader-ca.pem
-rw------- 1 root root 1440 Apr 19 16:34 kube-apiserver.pem
-rw------- 1 root root 1679 Apr 19 16:34 kube-ca-key.pem
-rw------- 1 root root 1017 Apr 19 16:34 kube-ca.pem
-rw------- 1 root root 1679 Apr 19 16:34 kube-controller-manager-key.pem
-rw------- 1 root root 1062 Apr 19 16:34 kube-controller-manager.pem
-rw------- 1 root root 1679 Apr 19 16:34 kube-etcd-172-27-26-100-key.pem
-rw------- 1 root root 1318 Apr 19 16:34 kube-etcd-172-27-26-100.pem
-rw------- 1 root root 1675 Apr 19 16:34 kube-etcd-172-27-26-101-key.pem
-rw------- 1 root root 1318 Apr 19 16:34 kube-etcd-172-27-26-101.pem
-rw------- 1 root root 1679 Apr 19 16:34 kube-etcd-172-27-26-102-key.pem
-rw------- 1 root root 1318 Apr 19 16:34 kube-etcd-172-27-26-102.pem
-rw------- 1 root root 1675 Apr 19 16:34 kube-node-key.pem
-rw------- 1 root root 1070 Apr 19 16:34 kube-node.pem
-rw------- 1 root root 1675 Apr 19 16:34 kube-proxy-key.pem
-rw------- 1 root root 1046 Apr 19 16:34 kube-proxy.pem
-rw------- 1 root root 1675 Apr 19 16:34 kube-scheduler-key.pem
-rw------- 1 root root 1050 Apr 19 16:34 kube-scheduler.pem
-rw------- 1 root root 1679 Apr 19 16:34 kube-service-account-token-key.pem
-rw------- 1 root root 1379 Apr 19 16:34 kube-service-account-token.pem
-rw------- 1 root root 517 Apr 19 16:34 kubecfg-kube-apiserver-proxy-client.yaml
-rw------- 1 root root 533 Apr 19 16:34 kubecfg-kube-apiserver-requestheader-ca.yaml
-rw------- 1 root root 501 Apr 19 16:34 kubecfg-kube-controller-manager.yaml
-rw------- 1 root root 445 Apr 19 16:34 kubecfg-kube-node.yaml
-rw------- 1 root root 449 Apr 19 16:34 kubecfg-kube-proxy.yaml
-rw------- 1 root root 465 Apr 19 16:34 kubecfg-kube-scheduler.yaml
root@dmz-kub-mas-s1:/# ls -la /var/lib/etcd
total 4
drwx------ 3 root root 20 Apr 29 18:44 .
drwxr-xr-x 47 root root 4096 Apr 19 16:38 ..
drwx------ 4 root root 29 Apr 29 18:44 member
We have not figured it out yet. We're just leaving the cluster alone right now since it's nonprod for us. We haven't tried to replace any prod masters due to this, though.
Exact same issue here, with a slightly different set-up:
RKE version: 1.2.8
Docker version:
Client: Docker Engine - Community
Version: 20.10.6
API version: 1.41
Go version: go1.13.15
Git commit: 370c289
Built: Fri Apr 9 22:46:45 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.6
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 8728dd2
Built: Fri Apr 9 22:44:56 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.4
GitCommit: 05f951a3781f4f2c1911b05e61c160e9c30eaa8e
runc:
Version: 1.0.0-rc93
GitCommit: 12644e614e25b05da6fd08a38ffa0cfe1903fdec
docker-init:
Version: 0.19.0
GitCommit: de40ad0
Operating system and kernel:
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) OpenStack IaaS
cluster.yml file:
nodes:
- address: {redacted}worker-01
port: "22"
internal_address: "{redacted}2.31"
role:
- worker
hostname_override: "{redacted}worker-01"
user: rke
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: /home/rke/.ssh/id_ecdsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
- address: {redacted}worker-02
port: "22"
internal_address: "{redacted}2.32"
role:
- worker
hostname_override: "{redacted}worker-02"
user: rke
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: /home/rke/.ssh/id_ecdsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
- address: {redacted}worker-03
port: "22"
internal_address: "{redacted}2.33"
role:
- worker
hostname_override: "{redacted}worker-03"
user: rke
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: /home/rke/.ssh/id_ecdsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
- address: {redacted}master-01
port: "22"
internal_address: "{redacted}2.21"
role:
- controlplane
- etcd
hostname_override: "{redacted}master-01"
user: rke
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: /home/rke/.ssh/id_ecdsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
- address: {redacted}master-02
port: "22"
internal_address: "{redacted}2.22"
role:
- controlplane
- etcd
hostname_override: "{redacted}master-02"
user: rke
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: /home/rke/.ssh/id_ecdsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
- address: {redacted}master-03
port: "22"
internal_address: "{redacted}2.23"
role:
- controlplane
- etcd
hostname_override: "{redacted}master-03"
user: rke
docker_socket: /var/run/docker.sock
ssh_key: ""
ssh_key_path: /home/rke/.ssh/id_ecdsa
ssh_cert: ""
ssh_cert_path: ""
labels: {}
taints: []
services:
etcd:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
external_urls: []
ca_cert: ""
cert: ""
key: ""
path: ""
uid: 52034
gid: 52034
snapshot: null
retention: ""
creation: ""
backup_config:
interval_hours: 3
kube-api:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
service_cluster_ip_range: 10.43.0.0/16
service_node_port_range: ""
pod_security_policy: false
always_pull_images: false
secrets_encryption_config:
enabled: true
audit_log:
enabled: true
configuration:
max_age: 6
max_backup: 6
max_size: 110
path: /var/log/kube-audit/audit-log.json
format: json
policy:
apiVersion: audit.k8s.io/v1 # This is required.
kind: Policy
omitStages:
- "RequestReceived"
rules:
# Log pod changes at RequestResponse level
- level: RequestResponse
resources:
- group: ""
# Resource "pods" doesn't match requests to any subresource of pods,
# which is consistent with the RBAC policy.
resources: ["pods"]
# Log "pods/log", "pods/status" at Metadata level
- level: Metadata
resources:
- group: ""
resources: ["pods/log", "pods/status"]
# Don't log requests to a configmap called "controller-leader"
- level: None
resources:
- group: ""
resources: ["configmaps"]
resourceNames: ["controller-leader"]
# Don't log watch requests by the "system:kube-proxy" on endpoints or services
- level: None
users: ["system:kube-proxy"]
verbs: ["watch"]
resources:
- group: "" # core API group
resources: ["endpoints", "services"]
# Don't log authenticated requests to certain non-resource URL paths.
- level: None
userGroups: ["system:authenticated"]
nonResourceURLs:
- "/api*" # Wildcard matching.
- "/version"
# Log the request body of configmap changes in kube-system.
- level: Request
resources:
- group: "" # core API group
resources: ["configmaps"]
# This rule only applies to resources in the "kube-system" namespace.
# The empty string "" can be used to select non-namespaced resources.
namespaces: ["kube-system"]
# Log configmap and secret changes in all other namespaces at the Metadata level.
- level: Metadata
resources:
- group: "" # core API group
resources: ["secrets", "configmaps"]
# Log all other resources in core and extensions at the Request level.
- level: Request
resources:
- group: "" # core API group
- group: "extensions" # Version of group should NOT be included.
# A catch-all rule to log all other requests at the Metadata level.
- level: Metadata
# Long-running requests like watches that fall under this rule will not
# generate an audit event in RequestReceived.
omitStages:
- "RequestReceived"
admission_configuration: null
event_rate_limit:
enabled: true
kube-controller:
image: ""
extra_args:
feature-gates: "RotateKubeletServerCertificate=true"
extra_binds: []
extra_env: []
cluster_cidr: 10.42.0.0/16
service_cluster_ip_range: 10.43.0.0/16
scheduler:
image: ""
extra_args:
address: 127.0.0.1
profiling: 'false'
extra_binds: []
extra_env: []
kubelet:
image: ""
extra_args:
anonymous-auth: 'false'
event-qps: '0'
feature-gates: "RotateKubeletServerCertificate=true"
protect-kernel-defaults: "true"
tls-cipher-suites: "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
extra_binds: []
extra_env: []
cluster_domain: cluster.local
infra_container_image: ""
cluster_dns_server: 10.43.0.10
fail_swap_on: false
generate_serving_certificate: true
kubeproxy:
image: ""
extra_args: {}
extra_binds: []
extra_env: []
network:
plugin: canal
options: {}
mtu: 0
node_selector: {}
authentication:
strategy: x509
sans: []
webhook: null
addons:
addons_include: []
system_images:
etcd: rancher/coreos-etcd:v3.4.16-rancher1
alpine: rancher/rke-tools:v0.1.74
nginx_proxy: rancher/rke-tools:v0.1.74
cert_downloader: rancher/rke-tools:v0.1.74
kubernetes_services_sidecar: rancher/rke-tools:v0.1.74
kubedns: rancher/k8s-dns-kube-dns:1.15.10
dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.10
kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.10
kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.8.1
coredns: rancher/coredns-coredns:1.8.3
coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.8.1
nodelocal: rancher/k8s-dns-node-cache:1.15.7
kubernetes: rancher/hyperkube:v1.18.18-rancher1
flannel: rancher/coreos-flannel:v0.13.0-rancher1
flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
calico_node: rancher/calico-node:v3.18.1
calico_cni: rancher/calico-cni:v3.18.1
calico_controllers: rancher/calico-kube-controllers:v3.18.1
calico_ctl: rancher/calico-ctl:v3.18.1
calico_flexvol: rancher/calico-pod2daemon-flexvol:v3.18.1
canal_node: rancher/calico-node:v3.18.1
canal_cni: rancher/calico-cni:v3.18.1
canal_flannel: rancher/coreos-flannel:v0.13.0-rancher1
canal_flexvol: rancher/calico-pod2daemon-flexvol:v3.18.1
weave_node: weaveworks/weave-kube:2.6.5
weave_cni: weaveworks/weave-npc:2.6.5
pod_infra_container: rancher/pause:3.2
ingress: rancher/nginx-ingress-controller:nginx-0.43.0-rancher1
ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
metrics_server: rancher/metrics-server:v0.4.1
windows_pod_infra_container: rancher/kubelet-pause:v0.1.6
ssh_key_path: /home/rke/.ssh/id_ecdsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
mode: rbac
options: {}
ignore_docker_version: false
kubernetes_version: ""
enable_network_policy: true
#default_pod_security_policy_template_id: "restricted"
private_registries: []
ingress:
provider: ""
options: {}
node_selector: {}
extra_args: {}
dns_policy: ""
extra_envs: []
extra_volumes: []
extra_volume_mounts: []
cluster_name: "rancher-accept"
cloud_provider:
name: openstack
openstackCloudProvider:
global:
username: "{redacted}"
password: "a{redacted}"
auth-url: "{redacted}"
tenant-id: "{redacted}"
domain-name: "Default"
load_balancer:
subnet-id: "{redacted}"
block_storage:
ignore-volume-az: true
prefix_path: ""
addon_job_timeout: 0
bastion_host:
address: ""
port: ""
user: ""
ssh_key: ""
ssh_key_path: /home/rke/.ssh/id_ecdsa
ssh_cert: ""
ssh_cert_path: ""
monitoring:
provider: ""
options: {}
node_selector: {}
restore:
restore: false
snapshot_name: ""
dns: null
Steps to Reproduce:
master-01 ETCD-leader log
2021-05-20 09:30:08.155398 W | rafthttp: rejected the stream from peer 6d321c2ad5b664dd since it was removed
2021-05-20 09:31:08.428355 W | rafthttp: rejected the stream from peer 6d321c2ad5b664dd since it was removed
2021-05-20 09:31:08.434061 W | rafthttp: rejected the stream from peer 6d321c2ad5b664dd since it was removed
2021-05-20 09:31:08.466015 I | embed: rejected connection from "{{redacted}}.2.23:55912" (error "read tcp {{redacted}}.2.21:2380->{{redacted}}.2.23:55912: read: connection reset by peer", ServerName "")
2021-05-20 09:31:08.466066 I | embed: rejected connection from "{{redacted}}.2.23:55908" (error "read tcp {{redacted}}.2.21:2380->{{redacted}}.2.23:55908: read: connection reset by peer", ServerName "")
master-03 ETCD log
raft2021/05/20 09:41:11 INFO: 6d321c2ad5b664dd became follower at term 1
raft2021/05/20 09:41:11 INFO: newRaft 6d321c2ad5b664dd [peers: [], term: 1, commit: 3, applied: 0, lastindex: 3, lastterm: 1]
2021-05-20 09:41:11.119000 W | auth: simple token is not cryptographically signed
2021-05-20 09:41:11.122336 I | etcdserver: starting server... [version: 3.4.16, cluster version: to_be_decided]
2021-05-20 09:41:11.124625 I | embed: ClientTLS: cert = /etc/kubernetes/ssl/kube-etcd-{{redacted}}-2-23.pem, key = /etc/kubernetes/ssl/kube-etcd-{{redacted}}-2-23-key.pem, trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true, crl-file =
raft2021/05/20 09:41:11 INFO: 6d321c2ad5b664dd switched to configuration voters=(710006653955149030)
2021-05-20 09:41:11.125958 I | etcdserver/membership: added member 9da734a3d290ce6 [https://{{redacted}}.2.22:2380] to cluster cab14b13e4dd4b1d
2021-05-20 09:41:11.125986 I | rafthttp: starting peer 9da734a3d290ce6...
2021-05-20 09:41:11.126365 I | embed: listening for peers on {{redacted}}.2.23:2380
2021-05-20 09:41:11.126462 I | rafthttp: started HTTP pipelining with peer 9da734a3d290ce6
2021-05-20 09:41:11.129590 I | rafthttp: started peer 9da734a3d290ce6
2021-05-20 09:41:11.129850 I | rafthttp: added peer 9da734a3d290ce6
2021-05-20 09:41:11.129919 I | rafthttp: started streaming with peer 9da734a3d290ce6 (stream MsgApp v2 reader)
raft2021/05/20 09:41:11 INFO: 6d321c2ad5b664dd switched to configuration voters=(710006653955149030 1611268644819755515)
2021-05-20 09:41:11.130141 I | etcdserver/membership: added member 165c604bac34e1fb [https://{{redacted}}.2.21:2380] to cluster cab14b13e4dd4b1d
2021-05-20 09:41:11.130209 I | rafthttp: starting peer 165c604bac34e1fb...
2021-05-20 09:41:11.130258 I | rafthttp: started HTTP pipelining with peer 165c604bac34e1fb
2021-05-20 09:41:11.130783 I | rafthttp: started streaming with peer 9da734a3d290ce6 (stream Message reader)
2021-05-20 09:41:11.131450 I | rafthttp: started streaming with peer 9da734a3d290ce6 (writer)
2021-05-20 09:41:11.131877 I | rafthttp: started peer 165c604bac34e1fb
2021-05-20 09:41:11.131952 I | rafthttp: added peer 165c604bac34e1fb
2021-05-20 09:41:11.132305 I | rafthttp: started streaming with peer 9da734a3d290ce6 (writer)
2021-05-20 09:41:11.132452 I | rafthttp: started streaming with peer 165c604bac34e1fb (writer)
2021-05-20 09:41:11.132646 I | rafthttp: started streaming with peer 165c604bac34e1fb (writer)
2021-05-20 09:41:11.132725 I | rafthttp: started streaming with peer 165c604bac34e1fb (stream MsgApp v2 reader)
raft2021/05/20 09:41:11 INFO: 6d321c2ad5b664dd switched to configuration voters=(710006653955149030 1611268644819755515 7868382469269382365)
2021-05-20 09:41:11.132936 I | rafthttp: started streaming with peer 165c604bac34e1fb (stream Message reader)
2021-05-20 09:41:11.132988 I | etcdserver/membership: added member 6d321c2ad5b664dd [https://{{redacted}}.2.23:2380] to cluster cab14b13e4dd4b1d
2021-05-20 09:41:11.141562 E | etcdserver: the member has been permanently removed from the cluster
2021-05-20 09:41:11.141578 I | etcdserver: the data-dir used by this member must be removed.
2021-05-20 09:41:11.141700 E | etcdserver: publish error: etcdserver: request cancelled
2021-05-20 09:41:11.141758 I | etcdserver: aborting publish because server is stopped
Current member list:
165c604bac34e1fb, started, etcd-fa-fi-rancher-accept-master-01, https://{{redacted}}.2.21:2380, https://{{redacted}}.2.21:2379, false
2b386ca11e386da1, started, etcd-fa-fi-rancher-accept-master-02, https://{{redacted}}.2.22:2380, https://{{redacted}}.2.22:2379, false
Deleted all containers and images from Docker
this is not enough, you need to remove the data on the host as well as its is mounted as a volume and will be reused. That's why the logging is also different. It is recommended to remove the host by removing it from cluster.yml
and using rke up
versus using manual steps.
Thanks @superseb! That fixed it for me.
This is how I did it:
cluster.yml
and restart rke up
cluster.yml
and reconciliation started and ETCD cluster was healthy again.Wonder if the same applies for @branttaylor . Apparently it's a no-go to remove a node from ETCD & Kubernetes by hand and just let RKE handle it. In hindsight this makes perfectly sense.
This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.
RKE version: 1.2.5
Docker version: (
docker version
,docker info
preferred)Operating system and kernel: (
cat /etc/os-release
,uname -r
preferred)Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) vSphere 6.7
cluster.yml file:
Steps to Reproduce:
rke up
with master missing from cluster.ymlrke up
with master added back to cluster.ymlResults:
rke up
completes successfully and the master showsReady
in the cluster, even though etcd is now unjoined from the cluster and is restarting over and over.master 1 etcd logs:
master 3 logs: