openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.44k stars 1.39k forks source link

Kubelet can't start API server if it needs a fresh certificate from API server #167

Closed jlebon closed 5 years ago

jlebon commented 6 years ago

In the local dev case, one may only have provisioned a single master. If one restart the master, then on restart, the kubelet will fail like so if the certificate expired:

Aug 24 15:41:58 test1-master-0 systemd[1]: Started Kubernetes Kubelet.
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --rotate-certificates has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluste
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --pod-manifest-path has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --allow-privileged has been deprecated, will be removed in a future version
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --minimum-container-ttl-duration has been deprecated, Use --eviction-hard or --eviction-soft instead. Will be removed in a future version.
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --cluster-dns has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kubele
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --cluster-domain has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kub
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --client-ca-file has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kub
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --anonymous-auth has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kub
Aug 24 15:41:59 test1-master-0 docker[19162]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluster/kube
Aug 24 15:41:59 test1-master-0 docker[19162]: I0824 15:41:59.069056   19185 server.go:418] Version: v1.11.0+d4cacc0
Aug 24 15:41:59 test1-master-0 docker[19162]: I0824 15:41:59.069191   19185 server.go:496] acquiring file lock on "/var/run/lock/kubelet.lock"
Aug 24 15:41:59 test1-master-0 docker[19162]: I0824 15:41:59.069220   19185 server.go:501] watching for inotify events for: /var/run/lock/kubelet.lock
Aug 24 15:41:59 test1-master-0 docker[19162]: I0824 15:41:59.069373   19185 plugins.go:97] No cloud provider specified.
Aug 24 15:41:59 test1-master-0 docker[19162]: E0824 15:41:59.071501   19185 bootstrap.go:195] Part of the existing bootstrap client certificate is expired: 2018-08-23 17:08:07 +0000 UTC
Aug 24 15:41:59 test1-master-0 docker[19162]: I0824 15:41:59.072551   19185 certificate_store.go:131] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".
Aug 24 15:41:59 test1-master-0 docker[19162]: F0824 15:41:59.093262   19185 server.go:262] failed to run Kubelet: cannot create certificate signing request: Post https://test1-api.mco.testing:6443/apis/certificates.k8s.io/v1beta1/certifica
Aug 24 15:41:59 test1-master-0 systemd[1]: kubelet.service: main process exited, code=exited, status=255/n/a
Aug 24 15:41:59 test1-master-0 systemd[1]: Unit kubelet.service entered failed state.
Aug 24 15:41:59 test1-master-0 systemd[1]: kubelet.service failed.

@aaronlevy says:

So what I”m thinking happened: we give master nodes a short-lived certificate (30min iirc) during the initial bootstrap. The intention is that this gets rotated out after that 30 minutes. However, if there was a single master and a reboot timed such that the rotation didn’t happen (and now it’s expired)… puts us in a bit of a pickle.

praveenkumar commented 5 years ago

@aaronlevy Can you provide where that short-lived certificate located on master or bootstrap. I am in a situation where I create cluster using libvirt (which was up and runing) then I shut it down and 2 days later it is not coming up, etcd have below logs.

I still have that setup handy if you need any more info, I just want to understand what is the best time I can shutdown my libvirt VM and then start when required without having any issue.

[root@test1-master-0 core]# crictl ps
CONTAINER ID        IMAGE                                                              CREATED             STATE               NAME                ATTEMPT
ab1c998534064       94bc3af972c98ce73f99d70bd72144caa8b63e541ccc9d844960b7f0ca77d7c4   4 minutes ago       Running             etcd-member         1
[root@test1-master-0 core]# crictl logs ab1c998534064
2018-12-05 09:41:31.214799 I | pkg/flags: recognized and used environment variable ETCD_DATA_DIR=/var/lib/etcd
2018-12-05 09:41:31.215415 I | pkg/flags: recognized and used environment variable ETCD_NAME=etcd-member-test1-master-0
2018-12-05 09:41:31.215476 I | etcdmain: etcd Version: 3.2.14
2018-12-05 09:41:31.215489 I | etcdmain: Git SHA: fb5cd6f1c
2018-12-05 09:41:31.215494 I | etcdmain: Go Version: go1.8.5
2018-12-05 09:41:31.215499 I | etcdmain: Go OS/Arch: linux/amd64
2018-12-05 09:41:31.215505 I | etcdmain: setting maximum number of CPUs to 2, total number of available CPUs is 2
2018-12-05 09:41:31.215686 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2018-12-05 09:41:31.215720 I | embed: peerTLS: cert = /etc/ssl/etcd/system:etcd-peer:test1-etcd-0.tt.testing.crt, key = /etc/ssl/etcd/system:etcd-peer:test1-etcd-0.tt.testing.key, ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth = true
2018-12-05 09:41:31.219274 I | embed: listening for peers on https://0.0.0.0:2380
2018-12-05 09:41:31.219572 I | embed: listening for client requests on 0.0.0.0:2379
2018-12-05 09:41:31.310536 I | etcdserver: name = etcd-member-test1-master-0
2018-12-05 09:41:31.311205 I | etcdserver: data dir = /var/lib/etcd
2018-12-05 09:41:31.311265 I | etcdserver: member dir = /var/lib/etcd/member
2018-12-05 09:41:31.311302 I | etcdserver: heartbeat = 100ms
2018-12-05 09:41:31.311393 I | etcdserver: election = 1000ms
2018-12-05 09:41:31.311473 I | etcdserver: snapshot count = 100000
2018-12-05 09:41:31.311657 I | etcdserver: advertise client URLs = https://192.168.126.11:2379
2018-12-05 09:41:31.554976 I | etcdserver: restarting member 7d3fdaaceb134d3d in cluster d98ef57fc5131193 at commit index 15764
2018-12-05 09:41:31.556475 I | raft: 7d3fdaaceb134d3d became follower at term 2
2018-12-05 09:41:31.556576 I | raft: newRaft 7d3fdaaceb134d3d [peers: [], term: 2, commit: 15764, applied: 0, lastindex: 15764, lastterm: 2]
2018-12-05 09:41:31.710712 W | auth: simple token is not cryptographically signed
2018-12-05 09:41:31.739007 I | etcdserver: starting server... [version: 3.2.14, cluster version: to_be_decided]
2018-12-05 09:41:31.744323 I | embed: ClientTLS: cert = /etc/ssl/etcd/system:etcd-server:test1-etcd-0.tt.testing.crt, key = /etc/ssl/etcd/system:etcd-server:test1-etcd-0.tt.testing.key, ca = , trusted-ca = /etc/ssl/etcd/ca.crt, client-cert-auth = true
2018-12-05 09:41:31.749681 I | etcdserver/membership: added member 7d3fdaaceb134d3d [https://test1-etcd-0.tt.testing:2380] to cluster d98ef57fc5131193
2018-12-05 09:41:31.750073 N | etcdserver/membership: set the initial cluster version to 3.2
2018-12-05 09:41:31.750222 I | etcdserver/api: enabled capabilities for version 3.2
2018-12-05 09:41:32.458097 I | raft: 7d3fdaaceb134d3d is starting a new election at term 2
2018-12-05 09:41:32.458417 I | raft: 7d3fdaaceb134d3d became candidate at term 3
2018-12-05 09:41:32.458500 I | raft: 7d3fdaaceb134d3d received MsgVoteResp from 7d3fdaaceb134d3d at term 3
2018-12-05 09:41:32.458606 I | raft: 7d3fdaaceb134d3d became leader at term 3
2018-12-05 09:41:32.458666 I | raft: raft.node: 7d3fdaaceb134d3d elected leader 7d3fdaaceb134d3d at term 3
2018-12-05 09:41:32.466818 I | embed: ready to serve client requests
2018-12-05 09:41:32.467766 I | etcdserver: published {Name:etcd-member-test1-master-0 ClientURLs:[https://192.168.126.11:2379]} to cluster d98ef57fc5131193
2018-12-05 09:41:32.468564 I | embed: serving client requests on [::]:2379
WARNING: 2018/12/05 09:41:32 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
aaronlevy commented 5 years ago

I believe the locations are:

The kubelet will pick a random(?) time before expiration for it to request a new cert. So anywhere in the 30min window after starting, the cert might be rotated.

Ideally we would rotate immediately after it had posted CSR / got a full client cert. This is something that @abhinavdahiya was going to look into this sprint (see https://jira.coreos.com/browse/CORS-810). But there may be some kubelet behaviors that block this.

praveenkumar commented 5 years ago

@aaronlevy So below is the cert details of the master node where I am getting that error and I am not able to see if that is expired.

[root@test1-master-0 kubernetes]# pwd
/etc/kubernetes
[root@test1-master-0 kubernetes]# openssl x509 -in ca.crt -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 0 (0x0)
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: OU=openshift, CN=root-ca
        Validity
            Not Before: Dec  5 07:20:17 2018 GMT
            Not After : Dec  2 07:20:17 2028 GMT
        Subject: OU=openshift, CN=root-ca
[root@test1-master-0 kubernetes]# ls -al /etc/ssl/certs/
total 12
drwxr-xr-x. 2 root root  117 Dec  5 05:55 .
drwxr-xr-x. 5 root root   81 Dec  5 05:55 ..
lrwxrwxrwx. 1 root root   49 Dec  5 05:55 ca-bundle.crt -> /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
lrwxrwxrwx. 1 root root   55 Dec  5 05:55 ca-bundle.trust.crt -> /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt
-rwxr-xr-x. 1 root root  610 Dec  5 05:55 make-dummy-cert
-rw-r--r--. 1 root root 2516 Dec  5 05:55 Makefile
-rwxr-xr-x. 1 root root  829 Dec  5 05:55 renew-dummy-cert
[root@test1-master-0 kubernetes]# openssl x509 -in /etc/pki/ca-trust/extracted/openssl/ca-bundle.trust.crt -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 6828503384748696800 (0x5ec3b7a6437fa4e0)
    Signature Algorithm: sha1WithRSAEncryption
        Issuer: CN=ACCVRAIZ1, OU=PKIACCV, O=ACCV, C=ES
        Validity
            Not Before: May  5 09:37:37 2011 GMT
            Not After : Dec 31 09:37:37 2030 GMT
        Subject: CN=ACCVRAIZ1, OU=PKIACCV, O=ACCV, C=ES
[root@test1-master-0 kubernetes]# openssl x509 -in /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 6828503384748696800 (0x5ec3b7a6437fa4e0)
    Signature Algorithm: sha1WithRSAEncryption
        Issuer: CN=ACCVRAIZ1, OU=PKIACCV, O=ACCV, C=ES
        Validity
            Not Before: May  5 09:37:37 2011 GMT
            Not After : Dec 31 09:37:37 2030 GMT

[root@test1-master-0 kubernetes]# ls -al /var/lib/kubelet/pki/
total 8
drwxr-xr-x. 2 root root  166 Dec  5 07:35 .
drwxr-xr-x. 7 root root  153 Dec  5 07:28 ..
-rw-------. 1 root root 1187 Dec  5 07:28 kubelet-client-2018-12-05-07-28-01.pem
lrwxrwxrwx. 1 root root   59 Dec  5 07:28 kubelet-client-current.pem -> /var/lib/kubelet/pki/kubelet-client-2018-12-05-07-28-01.pem
-rw-------. 1 root root 1240 Dec  5 07:35 kubelet-server-2018-12-05-07-35-14.pem
lrwxrwxrwx. 1 root root   59 Dec  5 07:35 kubelet-server-current.pem -> /var/lib/kubelet/pki/kubelet-server-2018-12-05-07-35-14.pem
[root@test1-master-0 kubernetes]# openssl x509 -in /var/lib/kubelet/pki/kubelet-client-2018-12-05-07-28-01.pem -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            09:3f:c5:f3:f8:6d:24:e6:7d:18:3e:de:a8:66:5c:bc:90:4e:a8:04
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: OU=bootkube, CN=kube-ca
        Validity
            Not Before: Dec  5 07:23:00 2018 GMT
            Not After : Jan  4 07:23:00 2019 GMT
        Subject: O=system:nodes, CN=system:node:test1-master-0
        Subject Public Key Info:
[root@test1-master-0 kubernetes]# openssl x509 -in /var/lib/kubelet/pki/kubelet-server-2018-12-05-07-35-14.pem -text -noout
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            3e:3d:e3:cc:c8:02:ca:22:d6:1f:1f:e3:70:b0:35:45:8d:04:3c:3c
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: OU=bootkube, CN=kube-ca
        Validity
            Not Before: Dec  5 07:30:00 2018 GMT
            Not After : Jan  4 07:30:00 2019 GMT
        Subject: O=system:nodes, CN=system:node:test1-master-0
aaronlevy commented 5 years ago

From what you posted in https://github.com/openshift/installer/issues/167#issuecomment-444426953

WARNING: 2018/12/05 09:41:32 Failed to dial 0.0.0.0:2379:

etcd is what is listening on :2379, so I don't believe this is the same issue as the original. Might be better to open a new issue to discuss the separate problem you're having. On a side note - I'm unsure why it would be dialing 0.0.0.0 -- fine to listen on all interfaces, but that seems wrong / maybe etcd DNS is configured improperly?

wking commented 5 years ago

etcd is what is listening on :2379, so I don't believe this is the same issue as the original. Might be better to open a new issue to discuss the separate problem you're having.

Already moved to coreos/kubecsr#22 ;).

eparis commented 5 years ago

cert rotation and lifetimes are not something the installer will be addressing. Please work with the master team (preferably in BZ) for further discussion if you are having problems.