Closed gklein closed 5 years ago
Logs from master-2:
# sudo crictl logs $(sudo crictl ps --pod=$(sudo crictl pods --name=openshift-apiserver --quiet) --quiet)
Failed to execute operation: Unit file tuned.service does not exist.
I0514 13:25:47.529133 115916 openshift-tuned.go:176] Extracting tuned profiles
I0514 13:25:47.532185 115916 openshift-tuned.go:596] Resync period to pull node/pod labels: 57 [s]
E0514 13:25:47.532998 115916 openshift-tuned.go:686] Get https://172.30.0.1:443/api/v1/nodes/master-2: dial tcp 172.30.0.1:443: connect: connection refused
I0514 13:25:52.539859 115916 openshift-tuned.go:176] Extracting tuned profiles
I0514 13:25:52.545891 115916 openshift-tuned.go:596] Resync period to pull node/pod labels: 58 [s]
E0514 13:25:52.546284 115916 openshift-tuned.go:686] Get https://172.30.0.1:443/api/v1/nodes/master-2: dial tcp 172.30.0.1:443: connect: connection refused
I0514 13:25:57.546506 115916 openshift-tuned.go:176] Extracting tuned profiles
I0514 13:25:57.557974 115916 openshift-tuned.go:596] Resync period to pull node/pod labels: 56 [s]
# sudo crictl logs $(sudo crictl ps --pod=$(sudo crictl pods --name=keepalived --quiet) --quiet)
9: Track script chk_ocp is already running, expect idle - skipping run
Tue May 14 13:22:38 2019: Track script chk_ocp is already running, expect idle - skipping run
Tue May 14 13:22:40 2019: Track script chk_ocp is already running, expect idle - skipping run
Tue May 14 13:22:42 2019: Track script chk_ocp is already running, expect idle - skipping run
Tue May 14 13:22:44 2019: Track script chk_ocp is already running, expect idle - skipping run
Tue May 14 13:22:46 2019: Track script chk_ocp is already running, expect idle - skipping run
Tue May 14 13:22:48 2019: Track script chk_ocp is already running, expect idle - skipping run
Tue May 14 13:22:50 2019: Track script chk_ocp is already running, expect idle - skipping run
Tue May 14 13:22:52 2019: Track script chk_ocp is already running, expect idle - skipping run
Tue May 14 13:22:54 2019: Track script chk_ocp is already running, expect idle - skipping run
Tue May 14 13:22:56 2019: Track script chk_ocp is already running, expect idle - skipping run
Tue May 14 13:22:58 2019: Track script chk_ocp is already running, expect idle - skipping run
Just to clarify, you left a delay after the power-off before testing the API access via the VIP, right?
I've tested this several times before and it worked so if the VIP doesn't fail over the something must have changed, perhaps @yboaron or @celebdor may have seen something?
I’v kept master-0 down , and tried to access the vip few minutes after the host shutdown.
I think the title is a bit misleading, as it implies that there is a problem with VIP failover. In the logs included with your original post, it appears that the VIP did fail over properly to a different master (master-2?).
Instead, the API indeed seems down. It looks like you tried to show the api server logs, but it didn't? I'd be looking at the api server on the master where the VIP is though ...
I logged in to this cluster to take a look. This is the second time I've seen a cluster in this state.
To recap:
Now, what I see:
W0514 20:23:17.258188 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.ostest.test.metalkube.org:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.ostest.test.metalkube.org, not etcd-0.ostest.test.metalkube.org". Reconnecting...
W0514 20:23:18.009310 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.ostest.test.metalkube.org:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.ostest.test.metalkube.org, not etcd-0.ostest.test.metalkube.org". Reconnecting...
W0514 20:23:18.344866 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-0.ostest.test.metalkube.org:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd-0.ostest.test.metalkube.org on 192.168.111.2:53: no such host". Reconnecting...
- The kube-apiserver logs show that it fails connecting to etcd. See these errors which repeat over and over in the log:
W0514 20:23:17.258188 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-2.ostest.test.metalkube.org:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-2.ostest.test.metalkube.org, not etcd-0.ostest.test.metalkube.org". Reconnecting... W0514 20:23:18.009310 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-1.ostest.test.metalkube.org:2379 0 <nil>}. Err :connection error: desc = "transport: authentication handshake failed: x509: certificate is valid for localhost, etcd.kube-system.svc, etcd.kube-system.svc.cluster.local, etcd-1.ostest.test.metalkube.org, not etcd-0.ostest.test.metalkube.org". Reconnecting... W0514 20:23:18.344866 1 clientconn.go:1304] grpc: addrConn.createTransport failed to connect to {etcd-0.ostest.test.metalkube.org:2379 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup etcd-0.ostest.test.metalkube.org on 192.168.111.2:53: no such host". Reconnecting...
Note the hostname mismatch. It's failing because it's expecting the certs from etcd-1 and etcd-2 to be valid for etcd-0 for some reason.
This issue has been fixed in a newer version of OpenShift. We should not hit this problem after our next rebase.
It seems to be resolved with the latest rebase
$ openstack baremetal node list
+--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+
| UUID | Name | Instance UUID | Power State | Provisioning State | Maintenance |
+--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+
| 97ef4c94-231c-4b35-81c6-022047cc4306 | openshift-master-1 | None | power off | active | False |
| cac8b6a1-8483-4fe7-950f-57aa2a2d6a7e | openshift-master-2 | None | power on | active | False |
| 8be1e40c-93fa-40be-96ca-8d855e105792 | openshift-master-0 | None | power on | active | False |
| b6e4dc37-cfa5-4de3-89d8-1c03a4070c65 | openshift-worker-1 | b6e4dc37-cfa5-4de3-89d8-1c03a4070c65 | power on | active | False |
| a79b4964-be11-497e-af33-4fe14bff5de8 | openshift-worker-2 | a79b4964-be11-497e-af33-4fe14bff5de8 | power on | active | False |
| e6896e94-70cc-4d85-83e6-61db01f37da2 | openshift-worker-0 | e6896e94-70cc-4d85-83e6-61db01f37da2 | power on | active | False |
+--------------------------------------+--------------------+--------------------------------------+-------------+--------------------+-------------+
$ oc get nodes
NAME STATUS ROLES AGE VERSION
master-0 Ready master 27h v1.13.4+c3617b99f
master-1 NotReady master 27h v1.13.4+c3617b99f
master-2 Ready master 27h v1.13.4+c3617b99f
worker-0 Ready worker 27h v1.13.4+c3617b99f
worker-1 Ready worker 27h v1.13.4+c3617b99f
worker-2 Ready worker 27h v1.13.4+c3617b99f
Closing based on the latest test: https://github.com/openshift-metal3/dev-scripts/issues/534#issuecomment-493919967
Great! Thanks for the follow up!
Describe the bug When a single master goes down the api is no longer available (virt) To Reproduce
Expected/observed behavior The api vip should still be available with 2 masters