kopeio / etcd-manager

operator for etcd: moved to https://github.com/kubernetes-sigs/etcdadm
Apache License 2.0
164 stars 47 forks source link

Auth Errors on Cluster Upgrade #399

Open wskulley opened 3 years ago

wskulley commented 3 years ago

During upgrade from 'default' kops 1.18 to 'default' kops 1.19 encountered the following error on the first etcd-manager node to roll:

unable to grpc-ping discovered peer 10.28.114.172:3996: rpc error: code = Unavailable desc = connection error: desc = "transport: authentication handshake failed: x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

The replacement etcd would not join the existing cluster. Was able to bypass the issue by adding


    manager:
      env:
      - name: GODEBUG
        value: x509ignoreCN=0```

to the kops etcd-manager cluster specs.
ssoriche commented 3 years ago

We recently did the same upgrade from kops 1.18 to 1.19 and encountered the same error message. This however happened on our second etcd-manager node, and setting the value of GODEBUG=x509ignoreCN=0 on the replacement node did not allow etcd to start, which blocked kube-apiserver from starting, and so on.

In order to get etcd to start we had to perform a rolling update on the third node (which had the ip address from the error message) with the --cloudonly option specified. Once the third node was replaced (and hadn't necessarily rejoined the cluster), the second node started etcd and joined both the etcd and kubernetes clusters. The third node joined both clusters without issue.

sp-francisco-manas commented 3 years ago

⚠️ +1

Since etcd-manager upgrade to Go 1.15 (CommonName deprecation) all upgrades to kOps 1.19 are breaking (first master never joins the etcd clusters). The problem is that the certs being generated contains this field that has been deprecated for 20 years already, Go enforce this since 1.15 and it refuses to connect even if you have a AltNames field ( #362 added the field but it should have removed the CN too).

Until a proper fix is implemented you need to use the workaround to rollback to the old behaviour in Go. I think the proper solution is to stop generating certificates with CN on etcd-manager and rotate certs in all masters later on (1.20, 1.21?). But I'm not sure if there are some second-order effects issues by removing it.

Update: In our case, the issue was not related to this. We had a config mistake by binding both etcd and etcd-events to the same metrics port. The Go deprecation log is still appearing during startup but it was noise about this underlying issue.