rancher / rke

Rancher Kubernetes Engine (RKE), an extremely simple, lightning fast Kubernetes distribution that runs entirely within containers.
Apache License 2.0
3.22k stars 583 forks source link

etcdserver: the member has been permanently removed from the cluster #2512

Closed branttaylor closed 1 year ago

branttaylor commented 3 years ago

RKE version: 1.2.5

Docker version: (docker version,docker info preferred)

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)

Server:
 Containers: 33
  Running: 26
  Paused: 0
  Stopped: 7
 Images: 14
 Server Version: 19.03.14
 Storage Driver: overlay2
  Backing Filesystem: xfs
  Supports d_type: true
  Native Overlay Diff: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 269548fa27e0089a8b8278fc4fc781d7f65a939b
 runc version: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-70-generic
 Operating System: Ubuntu 20.04.2 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 3.816GiB
 Name: dmz-kub-mas-s3
 ID: LTAX:P27T:L7CU:HQMK:PDE3:KG6W:6QV2:7ZCY:GNUB:3LQV:KMF2:KNJR
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

Operating system and kernel: (cat /etc/os-release, uname -r preferred)

NAME="Ubuntu"
VERSION="20.04.2 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.2 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) vSphere 6.7

cluster.yml file:

nodes:
- address: 172.27.26.100
  port: "22"
  role:
  - controlplane
  - etcd
  hostname_override: dmz-kub-mas-s1
  user: sysadm
  docker_socket: /var/run/docker.sock
- address: 172.27.26.101
  port: "22"
  role:
  - controlplane
  - etcd
  hostname_override: dmz-kub-mas-s2
  user: sysadm
  docker_socket: /var/run/docker.sock
- address: 172.27.26.102
  port: "22"
  role:
  - controlplane
  - etcd
  hostname_override: dmz-kub-mas-s3
  user: sysadm
  docker_socket: /var/run/docker.sock
- address: 172.27.26.110
  port: "22"
  role:
  - worker
  hostname_override: dmz-kub-nod-s1
  user: sysadm
  docker_socket: /var/run/docker.sock
- address: 172.27.26.111
  port: "22"
  role:
  - worker
  hostname_override: dmz-kub-nod-s2
  user: sysadm
  docker_socket: /var/run/docker.sock
- address: 172.27.26.112
  port: "22"
  role:
  - worker
  hostname_override: dmz-kub-nod-s3
  user: sysadm
  docker_socket: /var/run/docker.sock
services:
  etcd:
    snapshot: true
    retention: 24h
    creation: 1h
    extra_args:
      enable-v2: "true"
  kube-api:
    service_cluster_ip_range: {redacted}
    pod_security_policy: true
    audit_log:
      enabled: true
      max_age: 10
      max_backup: 1
      max_size: 100
      path: /var/log/kube-audit/audit-log.json
      format: json
    extra_args:
      oidc-client-id: {redacted}
      oidc-issuer-url: {redacted}
      oidc-username-claim: email
      external-hostname: {redacted}
      enable-admission-plugins: NamespaceLifecycle,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,NodeRestriction,PodSecurityPolicy,Priority
      audit-policy-file: /etc/kubernetes/audit.yaml
  kube-controller:
    cluster_cidr: 100.114.0.0/16
    service_cluster_ip_range: 100.113.0.0/16
  kubelet:
    cluster_domain: cluster.local
    cluster_dns_server: 100.113.0.10
    fail_swap_on: false
network:
  plugin: weave
authentication:
  strategy: x509
  sans:
    - {redacted}
system_images:
  weave_node: weaveworks/weave-kube:2.7.0
  weave_cni: weaveworks/weave-npc:2.7.0
  kubernetes: rancher/hyperkube:v1.19.7-rancher1
ssh_key_path: id_rsa_rke
ssh_agent_auth: false
authorization:
  mode: rbac
ignore_docker_version: false
cluster_name: "${clustername}"
cloud_provider:
  name: vsphere
  vsphereCloudProvider:
    global:
      insecure-flag: true
    virtual_center:
      ${ip}:
        user: _serviceaccount@vsphere.local
        password: xxxv
        datacenters: xxx
    workspace:
      server: ${ip}
      datacenter: {redacted}
      default-datastore: ${datastore}
      folder: DMZ
addon_job_timeout: 120
ingress:
    provider: none

Steps to Reproduce:

Results:

master 1 etcd logs:

root@dmz-kub-mas-s1:/# docker exec etcd etcdctl member list
root@dmz-kub-mas-s1:/# 70f983b4851be2a1, started, etcd-dmz-kub-mas-s3, https://172.27.26.102:2380, https://172.27.26.102:2379, false
root@dmz-kub-mas-s1:/# ae8bf7c4437f1aec, started, etcd-dmz-kub-mas-s2, https://172.27.26.101:2380, https://172.27.26.101:2379, false
root@dmz-kub-mas-s1:/# docker exec -e ETCDCTL_ENDPOINTS=$(docker exec etcd /bin/sh -c "etcdctl member list | cut -d, -f5 | sed -e 's/ //g' | paste -sd ','") etcd etcdctl endpoint status --write-out table
Error response from daemon: Container f2e4dd16dbeee0c52311e316d3ced733e3f908c1090460ef1f3a26bd08cde1bf is restarting, wait until the container is running
Error response from daemon: Container f2e4dd16dbeee0c52311e316d3ced733e3f908c1090460ef1f3a26bd08cde1bf is restarting, wait until the container is running
root@dmz-kub-mas-s1:/# rafthttp: rejected the stream from peer e52199b5fe002b11 since it was removed^C
root@dmz-kub-mas-s1:/# docker logs etcd
2021-04-09 22:51:04.145875 W | pkg/flags: unrecognized environment variable ETCD_UNSUPPORTED_ARCH=x86_64
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-04-09 22:51:04.146020 I | etcdmain: etcd Version: 3.4.13
2021-04-09 22:51:04.146028 I | etcdmain: Git SHA: ae9734ed2
2021-04-09 22:51:04.146033 I | etcdmain: Go Version: go1.12.17
2021-04-09 22:51:04.146037 I | etcdmain: Go OS/Arch: linux/amd64
2021-04-09 22:51:04.146043 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-04-09 22:51:04.146211 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-172-27-26-100.pem, key = /etc/kubernetes/ssl/kube-etcd-172-27-26-100-key.pem, trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true, crl-file =
2021-04-09 22:51:04.148528 I | embed: name = etcd-dmz-kub-mas-s1
2021-04-09 22:51:04.148553 I | embed: data dir = /var/lib/rancher/etcd/
2021-04-09 22:51:04.148564 I | embed: member dir = /var/lib/rancher/etcd/member
2021-04-09 22:51:04.148572 I | embed: heartbeat = 500ms
2021-04-09 22:51:04.148580 I | embed: election = 5000ms
2021-04-09 22:51:04.148586 I | embed: snapshot count = 100000
2021-04-09 22:51:04.148600 I | embed: advertise client URLs = https://172.27.26.100:2379
2021-04-09 22:51:04.170815 I | etcdserver: starting member e52199b5fe002b11 in cluster 8e27ceb17688029b
raft2021/04/09 22:51:04 INFO: e52199b5fe002b11 switched to configuration voters=()
raft2021/04/09 22:51:04 INFO: e52199b5fe002b11 became follower at term 0
raft2021/04/09 22:51:04 INFO: newRaft e52199b5fe002b11 [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]
raft2021/04/09 22:51:04 INFO: e52199b5fe002b11 became follower at term 1
raft2021/04/09 22:51:04 INFO: e52199b5fe002b11 switched to configuration voters=(8140682612799431329)
raft2021/04/09 22:51:04 INFO: e52199b5fe002b11 switched to configuration voters=(8140682612799431329 12577418806680296172)
raft2021/04/09 22:51:04 INFO: e52199b5fe002b11 switched to configuration voters=(8140682612799431329 12577418806680296172 16510646715846503185)
2021-04-09 22:51:04.176217 W | auth: simple token is not cryptographically signed
2021-04-09 22:51:04.181979 I | rafthttp: starting peer 70f983b4851be2a1...
2021-04-09 22:51:04.182037 I | rafthttp: started HTTP pipelining with peer 70f983b4851be2a1
2021-04-09 22:51:04.184687 I | rafthttp: started peer 70f983b4851be2a1
2021-04-09 22:51:04.184878 I | rafthttp: added peer 70f983b4851be2a1
2021-04-09 22:51:04.184986 I | rafthttp: starting peer ae8bf7c4437f1aec...
2021-04-09 22:51:04.185071 I | rafthttp: started HTTP pipelining with peer ae8bf7c4437f1aec
2021-04-09 22:51:04.187131 I | rafthttp: started streaming with peer ae8bf7c4437f1aec (writer)
2021-04-09 22:51:04.187243 I | rafthttp: started streaming with peer 70f983b4851be2a1 (writer)
2021-04-09 22:51:04.187338 I | rafthttp: started streaming with peer 70f983b4851be2a1 (stream Message reader)
2021-04-09 22:51:04.187418 I | rafthttp: started streaming with peer 70f983b4851be2a1 (writer)
2021-04-09 22:51:04.187922 I | rafthttp: started streaming with peer ae8bf7c4437f1aec (writer)
2021-04-09 22:51:04.188106 I | rafthttp: started streaming with peer 70f983b4851be2a1 (stream MsgApp v2 reader)
2021-04-09 22:51:04.190502 I | rafthttp: started peer ae8bf7c4437f1aec
2021-04-09 22:51:04.190531 I | rafthttp: added peer ae8bf7c4437f1aec
2021-04-09 22:51:04.190579 I | etcdserver: starting server... [version: 3.4.13, cluster version: to_be_decided]
2021-04-09 22:51:04.194268 I | rafthttp: started streaming with peer ae8bf7c4437f1aec (stream MsgApp v2 reader)
raft2021/04/09 22:51:04 INFO: e52199b5fe002b11 switched to configuration voters=(8140682612799431329 12577418806680296172 16510646715846503185)
2021-04-09 22:51:04.202364 I | rafthttp: started streaming with peer ae8bf7c4437f1aec (stream Message reader)
2021-04-09 22:51:04.202604 I | etcdserver/membership: added member 70f983b4851be2a1 [https://172.27.26.102:2380] to cluster 8e27ceb17688029b
raft2021/04/09 22:51:04 INFO: e52199b5fe002b11 switched to configuration voters=(8140682612799431329 12577418806680296172 16510646715846503185)
2021-04-09 22:51:04.202813 I | etcdserver/membership: added member ae8bf7c4437f1aec [https://172.27.26.101:2380] to cluster 8e27ceb17688029b
raft2021/04/09 22:51:04 INFO: e52199b5fe002b11 switched to configuration voters=(8140682612799431329 12577418806680296172 16510646715846503185)
2021-04-09 22:51:04.203017 I | etcdserver/membership: added member e52199b5fe002b11 [https://172.27.26.100:2380] to cluster 8e27ceb17688029b
2021-04-09 22:51:04.203629 I | embed: ClientTLS: cert = /etc/kubernetes/ssl/kube-etcd-172-27-26-100.pem, key = /etc/kubernetes/ssl/kube-etcd-172-27-26-100-key.pem, trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true, crl-file =
2021-04-09 22:51:04.203792 I | embed: listening for peers on [::]:2380
2021-04-09 22:51:04.204290 E | etcdserver: the member has been permanently removed from the cluster
2021-04-09 22:51:04.204305 I | etcdserver: the data-dir used by this member must be removed.
2021-04-09 22:51:04.204340 I | etcdserver: aborting publish because server is stopped

master 3 logs:

root@dmz-kub-mas-s3:/# docker logs etcd
2021-04-09 22:51:30.308369 W | pkg/flags: unrecognized environment variable ETCD_UNSUPPORTED_ARCH=x86_64
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-04-09 22:51:30.308491 I | etcdmain: etcd Version: 3.4.13
2021-04-09 22:51:30.308499 I | etcdmain: Git SHA: ae9734ed2
2021-04-09 22:51:30.308504 I | etcdmain: Go Version: go1.12.17
2021-04-09 22:51:30.308711 I | etcdmain: Go OS/Arch: linux/amd64
2021-04-09 22:51:30.308718 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2021-04-09 22:51:30.309091 N | etcdmain: the server is already initialized as member before, starting as etcd member...
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2021-04-09 22:51:30.309255 I | embed: peerTLS: cert = /etc/kubernetes/ssl/kube-etcd-172-27-26-102.pem, key = /etc/kubernetes/ssl/kube-etcd-172-27-26-102-key.pem, trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true, crl-file =
2021-04-09 22:51:30.311980 I | embed: name = etcd-dmz-kub-mas-s3
2021-04-09 22:51:30.312001 I | embed: data dir = /var/lib/rancher/etcd/
2021-04-09 22:51:30.312009 I | embed: member dir = /var/lib/rancher/etcd/member
2021-04-09 22:51:30.312014 I | embed: heartbeat = 500ms
2021-04-09 22:51:30.312019 I | embed: election = 5000ms
2021-04-09 22:51:30.312023 I | embed: snapshot count = 100000
2021-04-09 22:51:30.312034 I | embed: advertise client URLs = https://172.27.26.102:2379
2021-04-09 22:51:30.312040 I | embed: initial advertise peer URLs = https://172.27.26.102:2380
2021-04-09 22:51:30.312047 I | embed: initial cluster =
2021-04-09 22:51:30.672407 I | etcdserver: recovered store from snapshot at index 3
2021-04-09 22:51:30.674816 I | mvcc: restore compact to 117667511
2021-04-09 22:51:31.046581 I | etcdserver: restarting member 70f983b4851be2a1 in cluster 8e27ceb17688029b at commit index 56503
raft2021/04/09 22:51:31 INFO: 70f983b4851be2a1 switched to configuration voters=(8140682612799431329 12577418806680296172 16510646715846503185)
raft2021/04/09 22:51:31 INFO: 70f983b4851be2a1 became follower at term 11
raft2021/04/09 22:51:31 INFO: newRaft 70f983b4851be2a1 [peers: [70f983b4851be2a1,ae8bf7c4437f1aec,e52199b5fe002b11], term: 11, commit: 56503, applied: 3, lastindex: 56503, lastterm: 11]
2021-04-09 22:51:31.050847 I | etcdserver/membership: added member 70f983b4851be2a1 [https://172.27.26.102:2380] to cluster 8e27ceb17688029b from store
2021-04-09 22:51:31.050870 I | etcdserver/membership: added member ae8bf7c4437f1aec [https://172.27.26.101:2380] to cluster 8e27ceb17688029b from store
2021-04-09 22:51:31.050878 I | etcdserver/membership: added member e52199b5fe002b11 [https://172.27.26.100:2380] to cluster 8e27ceb17688029b from store
2021-04-09 22:51:31.052155 W | auth: simple token is not cryptographically signed
2021-04-09 22:51:31.053345 I | mvcc: restore compact to 117667511
2021-04-09 22:51:31.075185 I | rafthttp: starting peer ae8bf7c4437f1aec...
2021-04-09 22:51:31.075261 I | rafthttp: started HTTP pipelining with peer ae8bf7c4437f1aec
2021-04-09 22:51:31.075698 I | rafthttp: started streaming with peer ae8bf7c4437f1aec (writer)
2021-04-09 22:51:31.075805 I | rafthttp: started streaming with peer ae8bf7c4437f1aec (writer)
2021-04-09 22:51:31.076181 I | rafthttp: started peer ae8bf7c4437f1aec
2021-04-09 22:51:31.076216 I | rafthttp: added peer ae8bf7c4437f1aec
2021-04-09 22:51:31.076229 I | rafthttp: starting peer e52199b5fe002b11...
2021-04-09 22:51:31.076268 I | rafthttp: started streaming with peer ae8bf7c4437f1aec (stream Message reader)
2021-04-09 22:51:31.076331 I | rafthttp: started streaming with peer ae8bf7c4437f1aec (stream MsgApp v2 reader)
2021-04-09 22:51:31.076417 I | rafthttp: started HTTP pipelining with peer e52199b5fe002b11
2021-04-09 22:51:31.076802 I | rafthttp: started streaming with peer e52199b5fe002b11 (writer)
2021-04-09 22:51:31.077348 I | rafthttp: started streaming with peer e52199b5fe002b11 (writer)
2021-04-09 22:51:31.078032 I | rafthttp: started peer e52199b5fe002b11
2021-04-09 22:51:31.078071 I | rafthttp: added peer e52199b5fe002b11
2021-04-09 22:51:31.078129 I | etcdserver: starting server... [version: 3.4.13, cluster version: to_be_decided]
2021-04-09 22:51:31.078302 I | rafthttp: started streaming with peer e52199b5fe002b11 (stream MsgApp v2 reader)
2021-04-09 22:51:31.078635 I | rafthttp: started streaming with peer e52199b5fe002b11 (stream Message reader)
2021-04-09 22:51:31.079056 N | etcdserver/membership: set the initial cluster version to 3.0
2021-04-09 22:51:31.079160 I | etcdserver/api: enabled capabilities for version 3.0
2021-04-09 22:51:31.079254 N | etcdserver/membership: updated the cluster version from 3.0 to 3.4
2021-04-09 22:51:31.079309 I | etcdserver/api: enabled capabilities for version 3.4
2021-04-09 22:51:31.083603 I | embed: ClientTLS: cert = /etc/kubernetes/ssl/kube-etcd-172-27-26-102.pem, key = /etc/kubernetes/ssl/kube-etcd-172-27-26-102-key.pem, trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true, crl-file =
2021-04-09 22:51:31.083825 I | embed: listening for peers on [::]:2380
raft2021/04/09 22:51:31 INFO: raft.node: 70f983b4851be2a1 elected leader ae8bf7c4437f1aec at term 11
2021-04-09 22:51:31.084695 I | rafthttp: peer ae8bf7c4437f1aec became active
2021-04-09 22:51:31.084726 I | rafthttp: established a TCP streaming connection with peer ae8bf7c4437f1aec (stream MsgApp v2 writer)
2021-04-09 22:51:31.084936 I | rafthttp: established a TCP streaming connection with peer ae8bf7c4437f1aec (stream Message writer)
2021-04-09 22:51:31.089180 I | rafthttp: established a TCP streaming connection with peer ae8bf7c4437f1aec (stream Message reader)
2021-04-09 22:51:31.093641 I | rafthttp: established a TCP streaming connection with peer ae8bf7c4437f1aec (stream MsgApp v2 reader)
raft2021/04/09 22:51:31 INFO: 70f983b4851be2a1 switched to configuration voters=(8140682612799431329 12577418806680296172)
2021-04-09 22:51:31.253882 I | etcdserver/membership: removed member e52199b5fe002b11 from cluster 8e27ceb17688029b
2021-04-09 22:51:31.254109 I | rafthttp: stopping peer e52199b5fe002b11...
2021-04-09 22:51:31.254234 I | rafthttp: stopped streaming with peer e52199b5fe002b11 (writer)
2021-04-09 22:51:31.254394 I | rafthttp: stopped streaming with peer e52199b5fe002b11 (writer)
2021-04-09 22:51:31.254532 I | rafthttp: stopped HTTP pipelining with peer e52199b5fe002b11
2021-04-09 22:51:31.254587 I | rafthttp: stopped streaming with peer e52199b5fe002b11 (stream MsgApp v2 reader)
2021-04-09 22:51:31.254610 I | rafthttp: stopped streaming with peer e52199b5fe002b11 (stream Message reader)
2021-04-09 22:51:31.254619 I | rafthttp: stopped peer e52199b5fe002b11
2021-04-09 22:51:31.254633 I | rafthttp: removed peer e52199b5fe002b11
2021-04-09 22:51:31.389242 I | embed: ready to serve client requests
2021-04-09 22:51:31.389306 I | etcdserver: published {Name:etcd-dmz-kub-mas-s3 ClientURLs:[https://172.27.26.102:2379]} to cluster 8e27ceb17688029b
2021-04-09 22:51:31.391202 I | embed: serving client requests on [::]:2379
2021-04-09 22:51:31.946987 W | rafthttp: rejected the stream from peer e52199b5fe002b11 since it was removed
superseb commented 3 years ago

Please supply exact steps/commands to reproduce, if I run the scenario you described, I see a working 3 node etcd cluster but maybe I'm not following the right steps. The only thing I can found on this error is a node with existing etcd data dir content (which should not be there if you use a new node, similar to what I did)

branttaylor commented 3 years ago

We have an existing set of RKE clusters that have been running for over 2 years, so it's not a brand new cluster that I am working with. Exact steps that I am observing:

I'm assuming that etcd has gotten into a very bad state, but I don't know what to do to resolve it. Other things I've tried:

superseb commented 3 years ago

Ok I will run it again, can you confirm that the recreated VM is clean and does not contain any data in the locations that RKE uses (in this case, particularly /var/lib/etcd)

branttaylor commented 3 years ago

Yes, we just double-checked our VM and found that this folder does not exist in it.

superseb commented 3 years ago

Can you share the following so I can compare them to my setup:

branttaylor commented 3 years ago

Sorry for the delay. We had to turn our attention elsewhere because we were bit by with the issue outlined in this blog post:

https://support.rancher.com/hc/en-us/articles/360058516672

Now that we're past that, here's the requested info!

root@dmz-kub-mas-s2:/# ls -la /etc/kubernetes/ssl
total 124
drwxr-xr-x 2 root root 4096 Apr  9 14:51 .
drwxr-x--- 4 root root   67 Apr 29 17:48 ..
-rw------- 1 root root 1679 Apr 19 16:19 kube-apiserver-key.pem
-rw------- 1 root root 1679 Apr 19 16:19 kube-apiserver-proxy-client-key.pem
-rw------- 1 root root 1107 Apr 19 16:19 kube-apiserver-proxy-client.pem
-rw------- 1 root root 1675 Apr 19 16:19 kube-apiserver-requestheader-ca-key.pem
-rw------- 1 root root 1082 Apr 19 16:19 kube-apiserver-requestheader-ca.pem
-rw------- 1 root root 1407 Apr 19 16:19 kube-apiserver.pem
-rw------- 1 root root 1679 Apr 19 16:19 kube-ca-key.pem
-rw------- 1 root root 1017 Apr 19 16:19 kube-ca.pem
-rw------- 1 root root 1679 Apr 19 16:19 kube-controller-manager-key.pem
-rw------- 1 root root 1062 Apr 19 16:19 kube-controller-manager.pem
-rw------- 1 root root 1679 Apr  9 14:51 kube-etcd-172-27-26-100-key.pem
-rw------- 1 root root 1318 Apr  9 14:51 kube-etcd-172-27-26-100.pem
-rw------- 1 root root 1675 Apr 19 16:19 kube-etcd-172-27-26-101-key.pem
-rw------- 1 root root 1289 Apr 19 16:19 kube-etcd-172-27-26-101.pem
-rw------- 1 root root 1679 Apr 19 16:19 kube-etcd-172-27-26-102-key.pem
-rw------- 1 root root 1289 Apr 19 16:19 kube-etcd-172-27-26-102.pem
-rw------- 1 root root 1675 Apr 19 16:19 kube-node-key.pem
-rw------- 1 root root 1070 Apr 19 16:19 kube-node.pem
-rw------- 1 root root 1675 Apr 19 16:19 kube-proxy-key.pem
-rw------- 1 root root 1046 Apr 19 16:19 kube-proxy.pem
-rw------- 1 root root 1675 Apr 19 16:19 kube-scheduler-key.pem
-rw------- 1 root root 1050 Apr 19 16:19 kube-scheduler.pem
-rw------- 1 root root 1679 Apr 19 16:19 kube-service-account-token-key.pem
-rw------- 1 root root 1379 Apr 19 16:19 kube-service-account-token.pem
-rw------- 1 root root  517 Apr  9 14:51 kubecfg-kube-apiserver-proxy-client.yaml
-rw------- 1 root root  533 Apr  9 14:51 kubecfg-kube-apiserver-requestheader-ca.yaml
-rw------- 1 root root  501 Apr  9 14:51 kubecfg-kube-controller-manager.yaml
-rw------- 1 root root  445 Apr  9 14:51 kubecfg-kube-node.yaml
-rw------- 1 root root  449 Apr  9 14:51 kubecfg-kube-proxy.yaml
-rw------- 1 root root  465 Apr  9 14:51 kubecfg-kube-scheduler.yaml
root@dmz-kub-mas-s2:/# ls -la /var/lib/etcd
total 4
drwx------  3 root root   20 Apr 19 16:35 .
drwxr-xr-x 47 root root 4096 Apr  9 13:18 ..
drwx------  4 root root   29 Apr 19 16:35 member

Note that docker exec etcd commands on master 1 all fail because the container will not stay running. The last time we attempted to add this master was Mon, 19 Apr 2021 16:38:09 -0500 according to a node describe.

The other requested info from that master:

root@dmz-kub-mas-s1:/# ls -la /etc/kubernetes/ssl
total 140
drwxr-xr-x 2 root root  4096 Apr 19 16:34 .
drwxr-x--- 3 root root 12288 Apr 29 17:48 ..
-rw------- 1 root root  1679 Apr 19 16:34 kube-apiserver-key.pem
-rw------- 1 root root  1679 Apr 19 16:34 kube-apiserver-proxy-client-key.pem
-rw------- 1 root root  1107 Apr 19 16:34 kube-apiserver-proxy-client.pem
-rw------- 1 root root  1675 Apr 19 16:34 kube-apiserver-requestheader-ca-key.pem
-rw------- 1 root root  1082 Apr 19 16:34 kube-apiserver-requestheader-ca.pem
-rw------- 1 root root  1440 Apr 19 16:34 kube-apiserver.pem
-rw------- 1 root root  1679 Apr 19 16:34 kube-ca-key.pem
-rw------- 1 root root  1017 Apr 19 16:34 kube-ca.pem
-rw------- 1 root root  1679 Apr 19 16:34 kube-controller-manager-key.pem
-rw------- 1 root root  1062 Apr 19 16:34 kube-controller-manager.pem
-rw------- 1 root root  1679 Apr 19 16:34 kube-etcd-172-27-26-100-key.pem
-rw------- 1 root root  1318 Apr 19 16:34 kube-etcd-172-27-26-100.pem
-rw------- 1 root root  1675 Apr 19 16:34 kube-etcd-172-27-26-101-key.pem
-rw------- 1 root root  1318 Apr 19 16:34 kube-etcd-172-27-26-101.pem
-rw------- 1 root root  1679 Apr 19 16:34 kube-etcd-172-27-26-102-key.pem
-rw------- 1 root root  1318 Apr 19 16:34 kube-etcd-172-27-26-102.pem
-rw------- 1 root root  1675 Apr 19 16:34 kube-node-key.pem
-rw------- 1 root root  1070 Apr 19 16:34 kube-node.pem
-rw------- 1 root root  1675 Apr 19 16:34 kube-proxy-key.pem
-rw------- 1 root root  1046 Apr 19 16:34 kube-proxy.pem
-rw------- 1 root root  1675 Apr 19 16:34 kube-scheduler-key.pem
-rw------- 1 root root  1050 Apr 19 16:34 kube-scheduler.pem
-rw------- 1 root root  1679 Apr 19 16:34 kube-service-account-token-key.pem
-rw------- 1 root root  1379 Apr 19 16:34 kube-service-account-token.pem
-rw------- 1 root root   517 Apr 19 16:34 kubecfg-kube-apiserver-proxy-client.yaml
-rw------- 1 root root   533 Apr 19 16:34 kubecfg-kube-apiserver-requestheader-ca.yaml
-rw------- 1 root root   501 Apr 19 16:34 kubecfg-kube-controller-manager.yaml
-rw------- 1 root root   445 Apr 19 16:34 kubecfg-kube-node.yaml
-rw------- 1 root root   449 Apr 19 16:34 kubecfg-kube-proxy.yaml
-rw------- 1 root root   465 Apr 19 16:34 kubecfg-kube-scheduler.yaml
root@dmz-kub-mas-s1:/# ls -la /var/lib/etcd
total 4
drwx------  3 root root   20 Apr 29 18:44 .
drwxr-xr-x 47 root root 4096 Apr 19 16:38 ..
drwx------  4 root root   29 Apr 29 18:44 member
branttaylor commented 3 years ago

We have not figured it out yet. We're just leaving the cluster alone right now since it's nonprod for us. We haven't tried to replace any prod masters due to this, though.

vlieftink commented 3 years ago

Exact same issue here, with a slightly different set-up:

RKE version: 1.2.8

Docker version:

Client: Docker Engine - Community
 Version:           20.10.6
 API version:       1.41
 Go version:        go1.13.15
 Git commit:        370c289
 Built:             Fri Apr  9 22:46:45 2021
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server: Docker Engine - Community
 Engine:
  Version:          20.10.6
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       8728dd2
  Built:            Fri Apr  9 22:44:56 2021
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.4.4
  GitCommit:        05f951a3781f4f2c1911b05e61c160e9c30eaa8e
 runc:
  Version:          1.0.0-rc93
  GitCommit:        12644e614e25b05da6fd08a38ffa0cfe1903fdec
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

Operating system and kernel:

PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Type/provider of hosts: (VirtualBox/Bare-metal/AWS/GCE/DO) OpenStack IaaS

cluster.yml file:

nodes:
- address: {redacted}worker-01
  port: "22"
  internal_address: "{redacted}2.31"
  role:
  - worker
  hostname_override: "{redacted}worker-01"
  user: rke
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: /home/rke/.ssh/id_ecdsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
- address: {redacted}worker-02
  port: "22"
  internal_address: "{redacted}2.32"
  role:
  - worker
  hostname_override: "{redacted}worker-02"
  user: rke
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: /home/rke/.ssh/id_ecdsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
- address: {redacted}worker-03
  port: "22"
  internal_address: "{redacted}2.33"
  role:
  - worker
  hostname_override: "{redacted}worker-03"
  user: rke
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: /home/rke/.ssh/id_ecdsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
- address: {redacted}master-01
  port: "22"
  internal_address: "{redacted}2.21"
  role:
  - controlplane
  - etcd
  hostname_override: "{redacted}master-01"
  user: rke
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: /home/rke/.ssh/id_ecdsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
- address: {redacted}master-02
  port: "22"
  internal_address: "{redacted}2.22"
  role:
  - controlplane
  - etcd
  hostname_override: "{redacted}master-02"
  user: rke
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: /home/rke/.ssh/id_ecdsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
- address: {redacted}master-03
  port: "22"
  internal_address: "{redacted}2.23"
  role:
  - controlplane
  - etcd
  hostname_override: "{redacted}master-03"
  user: rke
  docker_socket: /var/run/docker.sock
  ssh_key: ""
  ssh_key_path: /home/rke/.ssh/id_ecdsa
  ssh_cert: ""
  ssh_cert_path: ""
  labels: {}
  taints: []
services:
  etcd:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    external_urls: []
    ca_cert: ""
    cert: ""
    key: ""
    path: ""
    uid: 52034
    gid: 52034
    snapshot: null
    retention: ""
    creation: ""
    backup_config:
      interval_hours: 3
  kube-api:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
    service_cluster_ip_range: 10.43.0.0/16
    service_node_port_range: ""
    pod_security_policy: false
    always_pull_images: false
    secrets_encryption_config:
      enabled: true
    audit_log:
      enabled: true
      configuration:
        max_age: 6
        max_backup: 6
        max_size: 110
        path: /var/log/kube-audit/audit-log.json
        format: json
        policy:
          apiVersion: audit.k8s.io/v1 # This is required.
          kind: Policy
          omitStages:
            - "RequestReceived"
          rules:
            # Log pod changes at RequestResponse level
            - level: RequestResponse
              resources:
              - group: ""
                # Resource "pods" doesn't match requests to any subresource of pods,
                # which is consistent with the RBAC policy.
                resources: ["pods"]
            # Log "pods/log", "pods/status" at Metadata level
            - level: Metadata
              resources:
              - group: ""
                resources: ["pods/log", "pods/status"]

            # Don't log requests to a configmap called "controller-leader"
            - level: None
              resources:
              - group: ""
                resources: ["configmaps"]
                resourceNames: ["controller-leader"]

            # Don't log watch requests by the "system:kube-proxy" on endpoints or services
            - level: None
              users: ["system:kube-proxy"]
              verbs: ["watch"]
              resources:
              - group: "" # core API group
                resources: ["endpoints", "services"]

            # Don't log authenticated requests to certain non-resource URL paths.
            - level: None
              userGroups: ["system:authenticated"]
              nonResourceURLs:
              - "/api*" # Wildcard matching.
              - "/version"

            # Log the request body of configmap changes in kube-system.
            - level: Request
              resources:
              - group: "" # core API group
                resources: ["configmaps"]
              # This rule only applies to resources in the "kube-system" namespace.
              # The empty string "" can be used to select non-namespaced resources.
              namespaces: ["kube-system"]

            # Log configmap and secret changes in all other namespaces at the Metadata level.
            - level: Metadata
              resources:
              - group: "" # core API group
                resources: ["secrets", "configmaps"]

            # Log all other resources in core and extensions at the Request level.
            - level: Request
              resources:
              - group: "" # core API group
              - group: "extensions" # Version of group should NOT be included.

            # A catch-all rule to log all other requests at the Metadata level.
            - level: Metadata
              # Long-running requests like watches that fall under this rule will not
              # generate an audit event in RequestReceived.
              omitStages:
                - "RequestReceived"
    admission_configuration: null
    event_rate_limit:
      enabled: true
  kube-controller:
    image: ""
    extra_args:
      feature-gates: "RotateKubeletServerCertificate=true"
    extra_binds: []
    extra_env: []
    cluster_cidr: 10.42.0.0/16
    service_cluster_ip_range: 10.43.0.0/16
  scheduler:
    image: ""
    extra_args:
      address: 127.0.0.1
      profiling: 'false'
    extra_binds: []
    extra_env: []
  kubelet:
    image: ""
    extra_args:
      anonymous-auth: 'false'
      event-qps: '0'
      feature-gates: "RotateKubeletServerCertificate=true"
      protect-kernel-defaults: "true"
      tls-cipher-suites: "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
    extra_binds: []
    extra_env: []
    cluster_domain: cluster.local
    infra_container_image: ""
    cluster_dns_server: 10.43.0.10
    fail_swap_on: false
    generate_serving_certificate: true
  kubeproxy:
    image: ""
    extra_args: {}
    extra_binds: []
    extra_env: []
network:
  plugin: canal
  options: {}
  mtu: 0
  node_selector: {}
authentication:
  strategy: x509
  sans: []
  webhook: null
addons:
addons_include: []
system_images:
  etcd: rancher/coreos-etcd:v3.4.16-rancher1
  alpine: rancher/rke-tools:v0.1.74
  nginx_proxy: rancher/rke-tools:v0.1.74
  cert_downloader: rancher/rke-tools:v0.1.74
  kubernetes_services_sidecar: rancher/rke-tools:v0.1.74
  kubedns: rancher/k8s-dns-kube-dns:1.15.10
  dnsmasq: rancher/k8s-dns-dnsmasq-nanny:1.15.10
  kubedns_sidecar: rancher/k8s-dns-sidecar:1.15.10
  kubedns_autoscaler: rancher/cluster-proportional-autoscaler:1.8.1
  coredns: rancher/coredns-coredns:1.8.3
  coredns_autoscaler: rancher/cluster-proportional-autoscaler:1.8.1
  nodelocal: rancher/k8s-dns-node-cache:1.15.7
  kubernetes: rancher/hyperkube:v1.18.18-rancher1
  flannel: rancher/coreos-flannel:v0.13.0-rancher1
  flannel_cni: rancher/flannel-cni:v0.3.0-rancher6
  calico_node: rancher/calico-node:v3.18.1
  calico_cni: rancher/calico-cni:v3.18.1
  calico_controllers: rancher/calico-kube-controllers:v3.18.1
  calico_ctl: rancher/calico-ctl:v3.18.1
  calico_flexvol: rancher/calico-pod2daemon-flexvol:v3.18.1
  canal_node: rancher/calico-node:v3.18.1
  canal_cni: rancher/calico-cni:v3.18.1
  canal_flannel: rancher/coreos-flannel:v0.13.0-rancher1
  canal_flexvol: rancher/calico-pod2daemon-flexvol:v3.18.1
  weave_node: weaveworks/weave-kube:2.6.5
  weave_cni: weaveworks/weave-npc:2.6.5
  pod_infra_container: rancher/pause:3.2
  ingress: rancher/nginx-ingress-controller:nginx-0.43.0-rancher1
  ingress_backend: rancher/nginx-ingress-controller-defaultbackend:1.5-rancher1
  metrics_server: rancher/metrics-server:v0.4.1
  windows_pod_infra_container: rancher/kubelet-pause:v0.1.6
ssh_key_path: /home/rke/.ssh/id_ecdsa
ssh_cert_path: ""
ssh_agent_auth: false
authorization:
  mode: rbac
  options: {}
ignore_docker_version: false
kubernetes_version: ""
enable_network_policy: true
#default_pod_security_policy_template_id: "restricted"
private_registries: []
ingress:
  provider: ""
  options: {}
  node_selector: {}
  extra_args: {}
  dns_policy: ""
  extra_envs: []
  extra_volumes: []
  extra_volume_mounts: []
cluster_name: "rancher-accept"
cloud_provider:
    name: openstack
    openstackCloudProvider:
      global:
        username: "{redacted}"
        password: "a{redacted}"
        auth-url: "{redacted}"
        tenant-id: "{redacted}"
        domain-name: "Default"
      load_balancer:
        subnet-id: "{redacted}"
      block_storage:
        ignore-volume-az: true
prefix_path: ""
addon_job_timeout: 0
bastion_host:
  address: ""
  port: ""
  user: ""
  ssh_key: ""
  ssh_key_path: /home/rke/.ssh/id_ecdsa
  ssh_cert: ""
  ssh_cert_path: ""
monitoring:
  provider: ""
  options: {}
  node_selector: {}
restore:
  restore: false
  snapshot_name: ""
dns: null

Steps to Reproduce:

master-01 ETCD-leader log

2021-05-20 09:30:08.155398 W | rafthttp: rejected the stream from peer 6d321c2ad5b664dd since it was removed
2021-05-20 09:31:08.428355 W | rafthttp: rejected the stream from peer 6d321c2ad5b664dd since it was removed
2021-05-20 09:31:08.434061 W | rafthttp: rejected the stream from peer 6d321c2ad5b664dd since it was removed
2021-05-20 09:31:08.466015 I | embed: rejected connection from "{{redacted}}.2.23:55912" (error "read tcp {{redacted}}.2.21:2380->{{redacted}}.2.23:55912: read: connection reset by peer", ServerName "")
2021-05-20 09:31:08.466066 I | embed: rejected connection from "{{redacted}}.2.23:55908" (error "read tcp {{redacted}}.2.21:2380->{{redacted}}.2.23:55908: read: connection reset by peer", ServerName "")

master-03 ETCD log

raft2021/05/20 09:41:11 INFO: 6d321c2ad5b664dd became follower at term 1
raft2021/05/20 09:41:11 INFO: newRaft 6d321c2ad5b664dd [peers: [], term: 1, commit: 3, applied: 0, lastindex: 3, lastterm: 1]
2021-05-20 09:41:11.119000 W | auth: simple token is not cryptographically signed
2021-05-20 09:41:11.122336 I | etcdserver: starting server... [version: 3.4.16, cluster version: to_be_decided]
2021-05-20 09:41:11.124625 I | embed: ClientTLS: cert = /etc/kubernetes/ssl/kube-etcd-{{redacted}}-2-23.pem, key = /etc/kubernetes/ssl/kube-etcd-{{redacted}}-2-23-key.pem, trusted-ca = /etc/kubernetes/ssl/kube-ca.pem, client-cert-auth = true, crl-file =
raft2021/05/20 09:41:11 INFO: 6d321c2ad5b664dd switched to configuration voters=(710006653955149030)
2021-05-20 09:41:11.125958 I | etcdserver/membership: added member 9da734a3d290ce6 [https://{{redacted}}.2.22:2380] to cluster cab14b13e4dd4b1d
2021-05-20 09:41:11.125986 I | rafthttp: starting peer 9da734a3d290ce6...
2021-05-20 09:41:11.126365 I | embed: listening for peers on {{redacted}}.2.23:2380
2021-05-20 09:41:11.126462 I | rafthttp: started HTTP pipelining with peer 9da734a3d290ce6
2021-05-20 09:41:11.129590 I | rafthttp: started peer 9da734a3d290ce6
2021-05-20 09:41:11.129850 I | rafthttp: added peer 9da734a3d290ce6
2021-05-20 09:41:11.129919 I | rafthttp: started streaming with peer 9da734a3d290ce6 (stream MsgApp v2 reader)
raft2021/05/20 09:41:11 INFO: 6d321c2ad5b664dd switched to configuration voters=(710006653955149030 1611268644819755515)
2021-05-20 09:41:11.130141 I | etcdserver/membership: added member 165c604bac34e1fb [https://{{redacted}}.2.21:2380] to cluster cab14b13e4dd4b1d
2021-05-20 09:41:11.130209 I | rafthttp: starting peer 165c604bac34e1fb...
2021-05-20 09:41:11.130258 I | rafthttp: started HTTP pipelining with peer 165c604bac34e1fb
2021-05-20 09:41:11.130783 I | rafthttp: started streaming with peer 9da734a3d290ce6 (stream Message reader)
2021-05-20 09:41:11.131450 I | rafthttp: started streaming with peer 9da734a3d290ce6 (writer)
2021-05-20 09:41:11.131877 I | rafthttp: started peer 165c604bac34e1fb
2021-05-20 09:41:11.131952 I | rafthttp: added peer 165c604bac34e1fb
2021-05-20 09:41:11.132305 I | rafthttp: started streaming with peer 9da734a3d290ce6 (writer)
2021-05-20 09:41:11.132452 I | rafthttp: started streaming with peer 165c604bac34e1fb (writer)
2021-05-20 09:41:11.132646 I | rafthttp: started streaming with peer 165c604bac34e1fb (writer)
2021-05-20 09:41:11.132725 I | rafthttp: started streaming with peer 165c604bac34e1fb (stream MsgApp v2 reader)
raft2021/05/20 09:41:11 INFO: 6d321c2ad5b664dd switched to configuration voters=(710006653955149030 1611268644819755515 7868382469269382365)
2021-05-20 09:41:11.132936 I | rafthttp: started streaming with peer 165c604bac34e1fb (stream Message reader)
2021-05-20 09:41:11.132988 I | etcdserver/membership: added member 6d321c2ad5b664dd [https://{{redacted}}.2.23:2380] to cluster cab14b13e4dd4b1d
2021-05-20 09:41:11.141562 E | etcdserver: the member has been permanently removed from the cluster
2021-05-20 09:41:11.141578 I | etcdserver: the data-dir used by this member must be removed.
2021-05-20 09:41:11.141700 E | etcdserver: publish error: etcdserver: request cancelled
2021-05-20 09:41:11.141758 I | etcdserver: aborting publish because server is stopped

Current member list:

165c604bac34e1fb, started, etcd-fa-fi-rancher-accept-master-01, https://{{redacted}}.2.21:2380, https://{{redacted}}.2.21:2379, false
2b386ca11e386da1, started, etcd-fa-fi-rancher-accept-master-02, https://{{redacted}}.2.22:2380, https://{{redacted}}.2.22:2379, false
superseb commented 3 years ago

Deleted all containers and images from Docker this is not enough, you need to remove the data on the host as well as its is mounted as a volume and will be reused. That's why the logging is also different. It is recommended to remove the host by removing it from cluster.yml and using rke up versus using manual steps.

vlieftink commented 3 years ago

Thanks @superseb! That fixed it for me.

This is how I did it:

  1. Remove node from cluster.yml and restart rke up
  2. Cleanup the node by following the procedure: https://rancher.com/docs/rancher/v2.x/en/cluster-admin/cleaning-cluster-nodes/
  3. Re-add node to cluster.yml and reconciliation started and ETCD cluster was healthy again.

Wonder if the same applies for @branttaylor . Apparently it's a no-go to remove a node from ETCD & Kubernetes by hand and just let RKE handle it. In hindsight this makes perfectly sense.

github-actions[bot] commented 1 year ago

This repository uses an automated workflow to automatically label issues which have not had any activity (commit/comment/label) for 60 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the workflow can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the workflow will automatically close the issue in 14 days. Thank you for your contributions.