rancher / rke

Rancher Kubernetes Engine (RKE), an extremely simple, lightning fast Kubernetes distribution that runs entirely within containers.
Apache License 2.0
3.2k stars 580 forks source link

etcd rejected connection with error "remote error: tls: bad certificate" #1229

Closed mberdnikov closed 5 years ago

mberdnikov commented 5 years ago

RKE version: v0.2.0

Docker version:

Client:
 Version:           18.06.2-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        6d37f41
 Built:             Sun Feb 10 03:47:56 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.2-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       6d37f41
  Built:            Sun Feb 10 03:46:20 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Operating system and kernel:

NAME="Ubuntu"
VERSION="18.04.2 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.2 LTS"
VERSION_ID="18.04"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

4.15.0-45-generic

Type/provider of hosts: Hetzner Cloud

cluster.yml file:

nodes:
  - address: x.x.x.1
    internal_address: 10.17.6.24
    hostname_override: k8s-stage-master-4
    user: rancher
    role:
      - controlplane
      - etcd
  - address: x.x.x.2
    internal_address: 10.17.6.25
    hostname_override: k8s-stage-master-5
    user: rancher
    role:
      - controlplane
      - etcd
  - address: x.x.x.3
    internal_address: 10.17.6.26
    hostname_override: k8s-stage-master-6
    user: rancher
    role:
      - controlplane
      - etcd
  - address: x.x.x.4
    internal_address: 10.17.6.41
    hostname_override: k8s-stage-worker-1
    user: rancher
    role:
      - worker
  - address: x.x.x.5
    internal_address: 10.17.6.42
    hostname_override: k8s-stage-worker-2
    user: rancher
    role:
      - worker
  - address: x.x.x.6
    internal_address: 10.17.6.43
    labels:
      host-role: worker
      host-index: "5"
      shared-volume: yes
    hostname_override: k8s-stage-worker-3
    user: rancher
    role:
      - worker
  - address: x.x.x.7
    internal_address: 10.17.6.44
    hostname_override: k8s-stage-worker-4
    user: rancher
    role:
      - worker
  - address: x.x.x.8
    internal_address: 10.17.6.45
    hostname_override: k8s-stage-worker-5
    user: rancher
    role:
      - worker

kubernetes_version: "v1.13.4-rancher1-1"

Steps to Reproduce:

On an existing cluster created by version 0.1.17-rc4 using version 0.2.0 with change all internal_address.

$ rke up

... a couple of hours of pain due to non-updated etcd cluster members ...

on each master nodes:

$ docker rm --force etcd
$ rm -rf /var/lib/etcd/*
$ rm -f /etc/kubernetes/ssl/kube-etcd-*

then

$ rke up

Results:

INFO[0000] Initiating Kubernetes cluster                
INFO[0000] [certificates] Generating admin certificates and kubeconfig 
INFO[0000] Successfully Deployed state file at [./cluster.rkestate] 
INFO[0000] Building Kubernetes cluster                  
INFO[0000] [dialer] Setup tunnel for host [x.x.x.8] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.4] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.5] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.1] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.6] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.7] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.2] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.3] 
INFO[0001] [network] Deploying port listener containers 
INFO[0001] [network] Successfully started [rke-etcd-port-listener] container on host [x.x.x.2] 
INFO[0001] [network] Successfully started [rke-etcd-port-listener] container on host [x.x.x.3] 
INFO[0004] [network] Successfully started [rke-etcd-port-listener] container on host [x.x.x.1] 
INFO[0004] [network] Successfully started [rke-cp-port-listener] container on host [x.x.x.2] 
INFO[0004] [network] Successfully started [rke-cp-port-listener] container on host [x.x.x.3] 
INFO[0004] [network] Successfully started [rke-cp-port-listener] container on host [x.x.x.1] 
INFO[0006] [network] Port listener containers deployed successfully 
INFO[0006] [network] Running etcd <-> etcd port checks  
INFO[0006] [network] Successfully started [rke-port-checker] container on host [x.x.x.2] 
INFO[0006] [network] Successfully started [rke-port-checker] container on host [x.x.x.3] 
INFO[0006] [network] Successfully started [rke-port-checker] container on host [x.x.x.1] 
INFO[0007] [network] Running control plane -> etcd port checks 
INFO[0007] [network] Successfully started [rke-port-checker] container on host [x.x.x.3] 
INFO[0007] [network] Successfully started [rke-port-checker] container on host [x.x.x.1] 
INFO[0007] [network] Successfully started [rke-port-checker] container on host [x.x.x.2] 
INFO[0007] [network] Running control plane -> worker port checks 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.2] 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.1] 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.3] 
INFO[0008] [network] Running workers -> control plane port checks 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.6] 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.7] 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.4] 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.5] 
INFO[0009] [network] Successfully started [rke-port-checker] container on host [x.x.x.8] 
INFO[0009] [network] Checking KubeAPI port Control Plane hosts 
INFO[0009] [network] Removing port listener containers  
INFO[0010] [remove/rke-etcd-port-listener] Successfully removed container on host [x.x.x.3] 
INFO[0010] [remove/rke-etcd-port-listener] Successfully removed container on host [x.x.x.2] 
INFO[0010] [remove/rke-etcd-port-listener] Successfully removed container on host [x.x.x.1] 
INFO[0010] [remove/rke-cp-port-listener] Successfully removed container on host [x.x.x.3] 
INFO[0010] [remove/rke-cp-port-listener] Successfully removed container on host [x.x.x.1] 
INFO[0010] [remove/rke-cp-port-listener] Successfully removed container on host [x.x.x.2] 
INFO[0010] [remove/rke-worker-port-listener] Successfully removed container on host [x.x.x.6] 
INFO[0010] [remove/rke-worker-port-listener] Successfully removed container on host [x.x.x.7] 
INFO[0010] [remove/rke-worker-port-listener] Successfully removed container on host [x.x.x.4] 
INFO[0010] [remove/rke-worker-port-listener] Successfully removed container on host [x.x.x.5] 
INFO[0010] [remove/rke-worker-port-listener] Successfully removed container on host [x.x.x.8] 
INFO[0010] [network] Port listener containers removed successfully 
INFO[0010] [certificates] Deploying kubernetes certificates to Cluster nodes 
INFO[0017] [reconcile] Rebuilding and updating local kube config 
INFO[0017] Successfully Deployed locaINFO[0000] Initiating Kubernetes cluster                
INFO[0000] [certificates] Generating admin certificates and kubeconfig 
INFO[0000] Successfully Deployed state file at [./cluster.rkestate] 
INFO[0000] Building Kubernetes cluster                  
INFO[0000] [dialer] Setup tunnel for host [x.x.x.8] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.4] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.5] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.1] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.6] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.7] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.2] 
INFO[0000] [dialer] Setup tunnel for host [x.x.x.3] 
INFO[0001] [network] Deploying port listener containers 
INFO[0001] [network] Successfully started [rke-etcd-port-listener] container on host [x.x.x.2] 
INFO[0001] [network] Successfully started [rke-etcd-port-listener] container on host [x.x.x.3] 
INFO[0004] [network] Successfully started [rke-etcd-port-listener] container on host [x.x.x.1] 
INFO[0004] [network] Successfully started [rke-cp-port-listener] container on host [x.x.x.2] 
INFO[0004] [network] Successfully started [rke-cp-port-listener] container on host [x.x.x.3] 
INFO[0004] [network] Successfully started [rke-cp-port-listener] container on host [x.x.x.1] 
INFO[0006] [network] Port listener containers deployed successfully 
INFO[0006] [network] Running etcd <-> etcd port checks  
INFO[0006] [network] Successfully started [rke-port-checker] container on host [x.x.x.2] 
INFO[0006] [network] Successfully started [rke-port-checker] container on host [x.x.x.3] 
INFO[0006] [network] Successfully started [rke-port-checker] container on host [x.x.x.1] 
INFO[0007] [network] Running control plane -> etcd port checks 
INFO[0007] [network] Successfully started [rke-port-checker] container on host [x.x.x.3] 
INFO[0007] [network] Successfully started [rke-port-checker] container on host [x.x.x.1] 
INFO[0007] [network] Successfully started [rke-port-checker] container on host [x.x.x.2] 
INFO[0007] [network] Running control plane -> worker port checks 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.2] 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.1] 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.3] 
INFO[0008] [network] Running workers -> control plane port checks 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.6] 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.7] 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.4] 
INFO[0008] [network] Successfully started [rke-port-checker] container on host [x.x.x.5] 
INFO[0009] [network] Successfully started [rke-port-checker] container on host [x.x.x.8] 
INFO[0009] [network] Checking KubeAPI port Control Plane hosts 
INFO[0009] [network] Removing port listener containers  
INFO[0010] [remove/rke-etcd-port-listener] Successfully removed container on host [x.x.x.3] 
INFO[0010] [remove/rke-etcd-port-listener] Successfully removed container on host [x.x.x.2] 
INFO[0010] [remove/rke-etcd-port-listener] Successfully removed container on host [x.x.x.1] 
INFO[0010] [remove/rke-cp-port-listener] Successfully removed container on host [x.x.x.3] 
INFO[0010] [remove/rke-cp-port-listener] Successfully removed container on host [x.x.x.1] 
INFO[0010] [remove/rke-cp-port-listener] Successfully removed container on host [x.x.x.2] 
INFO[0010] [remove/rke-worker-port-listener] Successfully removed container on host [x.x.x.6] 
INFO[0010] [remove/rke-worker-port-listener] Successfully removed container on host [x.x.x.7] 
INFO[0010] [remove/rke-worker-port-listener] Successfully removed container on host [x.x.x.4] 
INFO[0010] [remove/rke-worker-port-listener] Successfully removed container on host [x.x.x.5] 
INFO[0010] [remove/rke-worker-port-listener] Successfully removed container on host [x.x.x.8] 
INFO[0010] [network] Port listener containers removed successfully 
INFO[0010] [certificates] Deploying kubernetes certificates to Cluster nodes 
INFO[0017] [reconcile] Rebuilding and updating local kube config 
INFO[0017] Successfully Deployed local admin kubeconfig at [./kube_config_cluster.yml] 
INFO[0017] Successfully Deployed local admin kubeconfig at [./kube_config_cluster.yml] 
INFO[0017] Successfully Deployed local admin kubeconfig at [./kube_config_cluster.yml] 
INFO[0017] [certificates] Successfully deployed kubernetes certificates to Cluster nodes 
INFO[0017] [reconcile] Reconciling cluster state        
INFO[0017] [reconcile] This is newly generated cluster  
INFO[0017] Pre-pulling kubernetes images                
INFO[0017] Kubernetes images pulled successfully        
INFO[0017] [etcd] Building up etcd plane..              
INFO[0017] [etcd] Successfully started [etcd] container on host [x.x.x.1] 
INFO[0017] [etcd] Saving snapshot [etcd-rolling-snapshots] on host [x.x.x.1] 
INFO[0017] [remove/etcd-rolling-snapshots] Successfully removed container on host [x.x.x.1] 
INFO[0018] [etcd] Successfully started [etcd-rolling-snapshots] container on host [x.x.x.1] 
INFO[0027] [certificates] Successfully started [rke-bundle-cert] container on host [x.x.x.1] 
INFO[0027] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [x.x.x.1] 
INFO[0027] [etcd] Successfully started [rke-log-linker] container on host [x.x.x.1] 
INFO[0028] [remove/rke-log-linker] Successfully removed container on host [x.x.x.1] 
INFO[0028] [etcd] Successfully started [etcd] container on host [x.x.x.2] 
INFO[0028] [etcd] Saving snapshot [etcd-rolling-snapshots] on host [x.x.x.2] 
INFO[0028] [remove/etcd-rolling-snapshots] Successfully removed container on host [x.x.x.2] 
INFO[0029] [etcd] Successfully started [etcd-rolling-snapshots] container on host [x.x.x.2] 
INFO[0034] [certificates] Successfully started [rke-bundle-cert] container on host [x.x.x.2] 
INFO[0034] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [x.x.x.2] 
INFO[0035] [etcd] Successfully started [rke-log-linker] container on host [x.x.x.2] 
INFO[0035] [remove/rke-log-linker] Successfully removed container on host [x.x.x.2] 
INFO[0036] [etcd] Successfully started [etcd] container on host [x.x.x.3] 
INFO[0036] [etcd] Saving snapshot [etcd-rolling-snapshots] on host [x.x.x.3] 
INFO[0036] [remove/etcd-rolling-snapshots] Successfully removed container on host [x.x.x.3] 
INFO[0036] [etcd] Successfully started [etcd-rolling-snapshots] container on host [x.x.x.3] 
INFO[0042] [certificates] Successfully started [rke-bundle-cert] container on host [x.x.x.3] 
INFO[0042] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [x.x.x.3] 
INFO[0043] [etcd] Successfully started [rke-log-linker] container on host [x.x.x.3] 
INFO[0043] [remove/rke-log-linker] Successfully removed container on host [x.x.x.3] 
INFO[0043] [etcd] Successfully started etcd plane.. Checking etcd cluster health 
FATA[0186] [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy 
l admin kubeconfig at [./kube_config_cluster.yml] 
INFO[0017] Successfully Deployed local admin kubeconfig at [./kube_config_cluster.yml] 
INFO[0017] Successfully Deployed local admin kubeconfig at [./kube_config_cluster.yml] 
INFO[0017] [certificates] Successfully deployed kubernetes certificates to Cluster nodes 
INFO[0017] [reconcile] Reconciling cluster state        
INFO[0017] [reconcile] This is newly generated cluster  
INFO[0017] Pre-pulling kubernetes images                
INFO[0017] Kubernetes images pulled successfully        
INFO[0017] [etcd] Building up etcd plane..              
INFO[0017] [etcd] Successfully started [etcd] container on host [x.x.x.1] 
INFO[0017] [etcd] Saving snapshot [etcd-rolling-snapshots] on host [x.x.x.1] 
INFO[0017] [remove/etcd-rolling-snapshots] Successfully removed container on host [x.x.x.1] 
INFO[0018] [etcd] Successfully started [etcd-rolling-snapshots] container on host [x.x.x.1] 
INFO[0027] [certificates] Successfully started [rke-bundle-cert] container on host [x.x.x.1] 
INFO[0027] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [x.x.x.1] 
INFO[0027] [etcd] Successfully started [rke-log-linker] container on host [x.x.x.1] 
INFO[0028] [remove/rke-log-linker] Successfully removed container on host [x.x.x.1] 
INFO[0028] [etcd] Successfully started [etcd] container on host [x.x.x.2] 
INFO[0028] [etcd] Saving snapshot [etcd-rolling-snapshots] on host [x.x.x.2] 
INFO[0028] [remove/etcd-rolling-snapshots] Successfully removed container on host [x.x.x.2] 
INFO[0029] [etcd] Successfully started [etcd-rolling-snapshots] container on host [x.x.x.2] 
INFO[0034] [certificates] Successfully started [rke-bundle-cert] container on host [x.x.x.2] 
INFO[0034] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [x.x.x.2] 
INFO[0035] [etcd] Successfully started [rke-log-linker] container on host [x.x.x.2] 
INFO[0035] [remove/rke-log-linker] Successfully removed container on host [x.x.x.2] 
INFO[0036] [etcd] Successfully started [etcd] container on host [x.x.x.3] 
INFO[0036] [etcd] Saving snapshot [etcd-rolling-snapshots] on host [x.x.x.3] 
INFO[0036] [remove/etcd-rolling-snapshots] Successfully removed container on host [x.x.x.3] 
INFO[0036] [etcd] Successfully started [etcd-rolling-snapshots] container on host [x.x.x.3] 
INFO[0042] [certificates] Successfully started [rke-bundle-cert] container on host [x.x.x.3] 
INFO[0042] [certificates] successfully saved certificate bundle [/opt/rke/etcd-snapshots//pki.bundle.tar.gz] on host [x.x.x.3] 
INFO[0043] [etcd] Successfully started [rke-log-linker] container on host [x.x.x.3] 
INFO[0043] [remove/rke-log-linker] Successfully removed container on host [x.x.x.3] 
INFO[0043] [etcd] Successfully started etcd plane.. Checking etcd cluster health 
FATA[0186] [etcd] Failed to bring up Etcd Plane: [etcd] Etcd Cluster is not healthy 

$ docker logs --tail=10 etcd
2019-03-26 14:30:36.752689 I | etcdmain: rejected connection from "10.17.6.25:44716" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 14:30:36.812684 I | etcdmain: rejected connection from "10.17.6.26:59238" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 14:30:36.825622 I | etcdmain: rejected connection from "10.17.6.25:44726" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 14:30:36.850627 I | etcdmain: rejected connection from "10.17.6.26:59242" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 14:30:36.860215 I | etcdmain: rejected connection from "10.17.6.25:44728" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 14:30:36.930649 I | etcdmain: rejected connection from "10.17.6.26:59248" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 14:30:36.933235 I | etcdmain: rejected connection from "10.17.6.25:44736" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 14:30:36.934178 I | etcdmain: rejected connection from "10.17.6.25:44740" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 14:30:36.959495 I | etcdmain: rejected connection from "10.17.6.26:59250" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 14:30:36.969000 I | etcdmain: rejected connection from "10.17.6.25:44744" (error "remote error: tls: bad certificate", ServerName "")

$ openssl verify -verbose -CAfile /etc/kubernetes/ssl/kube-ca.pem  /etc/kubernetes/ssl/kube-etcd-10-17-6-25.pem 
CN = kube-etcd
error 7 at 0 depth lookup: certificate signature failure
error /etc/kubernetes/ssl/kube-etcd-10-17-6-25.pem: verification failed
139809397821888:error:0407008A:rsa routines:RSA_padding_check_PKCS1_type_1:invalid padding:../crypto/rsa/rsa_pk1.c:67:
139809397821888:error:04067072:rsa routines:rsa_ossl_public_decrypt:padding check failed:../crypto/rsa/rsa_ossl.c:586:
139809397821888:error:0D0C5006:asn1 encoding routines:ASN1_item_verify:EVP lib:../crypto/asn1/a_verify.c:171:
superseb commented 5 years ago

Can you elaborate on what On an existing cluster created by version 0.1.17-rc4 using version 0.2.0 with change all internal_address. exactly means? I can't determine what steps to use to reproduce this?

Does this also reproduce when usin 0.1.17 (non RC) and 0.2.0?

mberdnikov commented 5 years ago

Hello @superseb !

My steps were:

  1. Changed the internal addresses of all 8 nodes (10.16.2.x => 10.17.6.x).
  2. Run rke up (version 0.1.17 rc4)
  3. Received not working etcd due to incorrect addresses.
2019-03-26 11:07:16.746979 W | rafthttp: health check for peer 1e68cb5571d2857f could not connect: dial tcp 10.16.2.25:2380: getsockopt: connection refused
2019-03-26 11:07:16.747123 W | rafthttp: health check for peer cab55970e9e091b4 could not connect: dial tcp 10.16.2.26:2380: getsockopt: connection refused
2019-03-26 11:07:19.986555 I | raft: ddf8001d91bec098 is starting a new election at term 405
2019-03-26 11:07:19.986886 I | raft: ddf8001d91bec098 became candidate at term 406
2019-03-26 11:07:19.986922 I | raft: ddf8001d91bec098 received MsgVoteResp from ddf8001d91bec098 at term 406
2019-03-26 11:07:19.986936 I | raft: ddf8001d91bec098 [logterm: 370, index: 10228338] sent MsgVote request to 1e68cb5571d2857f at term 406
2019-03-26 11:07:19.986948 I | raft: ddf8001d91bec098 [logterm: 370, index: 10228338] sent MsgVote request to cab55970e9e091b4 at term 406
2019-03-26 11:07:21.740463 E | etcdserver: publish error: etcdserver: request timed out
2019-03-26 11:07:21.747335 W | rafthttp: health check for peer 1e68cb5571d2857f could not connect: dial tcp 10.16.2.25:2380: getsockopt: connection refused
2019-03-26 11:07:21.747500 W | rafthttp: health check for peer cab55970e9e091b4 could not connect: dial tcp 10.16.2.26:2380: getsockopt: connection refused
2019-03-26 11:07:26.747734 W | rafthttp: health check for peer 1e68cb5571d2857f could not connect: dial tcp 10.16.2.25:2380: getsockopt: connection refused
2019-03-26 11:07:26.748024 W | rafthttp: health check for peer cab55970e9e091b4 could not connect: dial tcp 10.16.2.26:2380: getsockopt: connection refused
  1. Returned old addresses (10.17.6.x => 10.16.2.x) to make a backup.
  2. Set new internal addresses (10.16.2.x => 10.17.6.x) and rke up (version 0.2.0).
2019-03-26 12:41:13.428377 I | etcdmain: rejected connection from "10.17.6.26:53324" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.486180 I | etcdmain: rejected connection from "10.17.6.25:54038" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.488732 I | etcdmain: rejected connection from "10.17.6.25:54040" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.535268 I | etcdmain: rejected connection from "10.17.6.26:53332" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.535613 I | etcdmain: rejected connection from "10.17.6.26:53330" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.594807 I | etcdmain: rejected connection from "10.17.6.25:54070" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.598035 I | etcdmain: rejected connection from "10.17.6.25:54062" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.650692 I | etcdmain: rejected connection from "10.17.6.26:53338" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.655358 I | etcdmain: rejected connection from "10.17.6.26:53340" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.706482 I | etcdmain: rejected connection from "10.17.6.25:54082" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.706742 I | etcdmain: rejected connection from "10.17.6.25:54080" (error "remote error: tls: bad certificate", ServerName "")
2019-03-26 12:41:13.757674 I | etcdmain: rejected connection from "10.17.6.26:53346" (error "remote error: tls: bad certificate", ServerName "")
  1. Stopped etcd (docker rm --force etcd on each node), deleted all /etc/kubernetes/ssl/kube-etcd-* certificates, cleared /var/lib/etcd
  2. rke up (version 0.2.0)

No change.

  1. Repeat 6
  2. rke up (version 0.1.17)

No change.

Now I will try to completely remove the certificates. Perhaps kube-ca made by rke version 0.1.17-rc4 is not compatible with 0.2.0. openssl verify on old certificates was successful.

Tomorrow I will try to play on another cluster.

mberdnikov commented 5 years ago

After deleting all the certificates, it turned out to deploy a working etcd. But backup is not restored.

panic: runtime error: index out of range

goroutine 1 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Bucket).node(0xc4201e60f8, 0x33313a36343a3630, 0x0, 0x0)
    /tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/bucket.go:660 +0x231
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Cursor).node(0xc4201c9528, 0x12420f4)
    /tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/cursor.go:369 +0x1e3
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Bucket).CreateBucket(0xc4201e60f8, 0x12420f4, 0x5, 0x5, 0xc4201e8f68, 0xc4201c9608, 0xb03343)
    /tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/bucket.go:185 +0x33e
github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt.(*Tx).CreateBucket(0xc4201e60e0, 0x12420f4, 0x5, 0x5, 0xc4201c9650, 0x40e756, 0x7f18605c1528)
    /tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/bbolt/tx.go:108 +0x4f
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend.(*batchTx).UnsafeCreateBucket(0xc4201f6e10, 0x12420f4, 0x5, 0x5)
    /tmp/etcd-release-3.2.24/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/mvcc/backend/batch_tx.go:49 +0x6b
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/lease.(*lessor).initAndRecover(0xc420282960)
...

So tomorrow I will make a new cluster.

deniseschannon commented 5 years ago

@MarkBerdnikov I know you've made a new cluster.

Typically, we don't test/recommend upgrading between RCs to a later version as those are not tested paths. If you still face issues, please open a new issue.

oxr463 commented 3 years ago

+1

mmatthys commented 3 years ago

I ran into a similar issue on a cluster which had been shut down for over a year. The following got it back up and running for me:

  1. I regenerated all the RKE certs: https://rancher.com/blog/2019/kubernetes-certificate-expiry-and-rotation-in-rancher-kubernetes-clusters/
  2. Updated my RKE version
  3. Ran rke up (Twice since the worker plane took a bit of time to start)
  4. Restarted each node (some nodes had stayed stuck on old k8s version)

Not sure if it would have solved this issue with altered node IPs, but hope it helps someone.