Closed Oats87 closed 4 months ago
The pattern you're showing here doesn't make any sense to me. If we look at the release dates:
date | version | average |
---|---|---|
2023-09-18 | v1.27.6+rke2r1 | |
v1.28.2+rke2r1 | 529 | |
2023-10-31 | v1.27.7+rke2r1 | |
v1.28.3+rke2r1 | ||
2023-11-08 | v1.27.7+rke2r2 | |
v1.28.3+rke2r2 | ||
2023-12-05 | v1.27.8+rke2r1 | |
v1.28.4+rke2r1 | 536 | |
2023-12-26 | v1.27.9+rke2r1 | |
v1.28.5+rke2r1 | ||
2024-02-06 | v1.27.10+rke2r1 | 377 |
v1.28.6+rke2r1 | 531 | |
2024-02-29 | v1.27.11+rke2r1 | 375 |
v1.28.7+rke2r1 | ||
2024-03-20 | v1.27.12+rke2r1 | 399 |
v1.28.8+rke2r1 | 540 | |
2024-04-29 | v1.27.13+rke2r1 | 431 |
v1.28.9+rke2r1 | ||
2024-05-22 | v1.27.14+rke2r1 | 430 |
v1.28.10+rke2r1 |
First of all, this is hard to strictly compare because you appear to be somewhat arbitrarily picking the versions to test. There is some overlap in the middle, but by and large you've picked older 1.28 releases, and newer 1.27 releases. Some of the 1.28 releases which you show as being slower, are actually much older than the 1.27 releases that are faster. There are only two releases where we have data for both minors, and there appears to be no correlation at all in speed - if anything it suggests that there is something in Kubernetes 1.28 itself that is slower?
It looks like you've got this set up to collect logs from the runs, can you share those logs so that we can perhaps see where it's waiting?
Just to ensure I'm getting the right context here, is this causing tests to timeout and fail? Or are we trying to make sure we maintain velocity? (Both are equally important I'm just trying to understand the effects better)
@brandond it might be good if we (or QA) can do some runs against 1.29 releases as well to see if there's any difference there. My gut feel is that something is off with 1.28, but I've certainly been wrong before
The pattern you're showing here doesn't make any sense to me. If we look at the release dates: ...
Looking at the table the main takeaway for me is that it's 1.28 itself where there is a noticeable slowdown of cert rotation operation (~20-25% on average). Combined with (much smaller) slowdown in newer patches of 1.27 the overall performance loss is >40%. So focusing on 1.28 "itself" (vs. monthly patches) changes could probably be a good start.
Just to ensure I'm getting the right context here, is this causing tests to timeout and fail? Or are we trying to make sure we maintain velocity? (Both are equally important I'm just trying to understand the effects better)
Yes, our timeout for executing single test is 600s and we've started hitting it quite a lot (see https://github.com/rancher/rancher/issues/45577) and temporarily pinned version to v1.27.10+rke2r1
to improve CI stability.
If it's in Kubernetes 1.28 itself there is not necessarily much we can do about it, but I'll try to investigate.
I'm trying to use the script from @Oats87 to test with the current code from the release/v2.9 branch and it appears something weird is going on... using both rke2 and k3s, the cluster fails to even come up, apparently because the initial etcd node is being configured to join itself (server points at its own IP), instead of being started with cluster-init: true
?
cc968520a120:/go/src/github.com/rancher/rancher # kubectl get pod -A -o wide
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
cattle-fleet-local-system fleet-agent-0 2/2 Running 0 12m 10.42.0.11 local-node <none> <none>
cattle-fleet-system fleet-controller-6dd465cd9-x2gzz 3/3 Running 0 12m 10.42.0.6 local-node <none> <none>
cattle-fleet-system gitjob-f696768db-5gtg5 1/1 Running 0 12m 10.42.0.5 local-node <none> <none>
cattle-provisioning-capi-system capi-controller-manager-9b9c8ccf8-9x568 1/1 Running 0 11m 10.42.0.15 local-node <none> <none>
cattle-system helm-operation-5rfht 0/2 Completed 0 12m 10.42.0.4 local-node <none> <none>
cattle-system helm-operation-9lzgj 0/2 Completed 0 12m 10.42.0.3 local-node <none> <none>
cattle-system helm-operation-cc2np 0/2 Completed 0 11m 10.42.0.21 local-node <none> <none>
cattle-system helm-operation-gmcrp 0/2 Completed 0 12m 10.42.0.9 local-node <none> <none>
cattle-system helm-operation-h4df9 0/2 Completed 0 11m 10.42.0.14 local-node <none> <none>
cattle-system helm-operation-lm57n 0/2 Completed 0 12m 10.42.0.12 local-node <none> <none>
cattle-system rancher-webhook-5f69889cd8-v2mht 1/1 Running 0 12m 10.42.0.13 local-node <none> <none>
default registry-cache 1/1 Running 0 11m 10.42.0.17 local-node <none> <none>
kube-system coredns-576bfc4dc7-kxmjf 1/1 Running 0 13m 10.42.0.2 local-node <none> <none>
test-ns-56jsf test-node-bspzv 1/1 Running 2 (10m ago) 11m 10.42.0.18 local-node <none> <none>
test-ns-56jsf test-node-kjlwc 1/1 Running 1 (10m ago) 11m 10.42.0.19 local-node <none> <none>
test-ns-56jsf test-node-p9hgz 1/1 Running 0 11m 10.42.0.20 local-node <none> <none>
cc968520a120:/go/src/github.com/rancher/rancher # kubectl logs --tail=10 -n test-ns-56jsf test-node-kjlwc
Jul 03 16:53:13 test-node-kjlwc sh[6924]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Jul 03 16:53:13 test-node-kjlwc sh[6925]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jul 03 16:53:13 test-node-kjlwc (modprobe)[6926]: k3s.service: Executable /sbin/modprobe missing, skipping: No such file or directory
Jul 03 16:53:13 test-node-kjlwc (modprobe)[6927]: k3s.service: Executable /sbin/modprobe missing, skipping: No such file or directory
Jul 03 16:53:13 test-node-kjlwc k3s[6928]: time="2024-07-03T16:53:13Z" level=info msg="ETCD snapshots are disabled"
Jul 03 16:53:13 test-node-kjlwc k3s[6928]: time="2024-07-03T16:53:13Z" level=info msg="Starting k3s v1.27.6+k3s1 (bd04941a)"
Jul 03 16:53:13 test-node-kjlwc k3s[6928]: time="2024-07-03T16:53:13Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: Get \"https://10.42.0.19:6443/cacerts\": dial tcp 10.42.0.19:6443: connect: connection refused"
Jul 03 16:53:13 test-node-kjlwc systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Jul 03 16:53:13 test-node-kjlwc systemd[1]: k3s.service: Failed with result 'exit-code'.
Jul 03 16:53:13 test-node-kjlwc systemd[1]: Failed to start Lightweight Kubernetes.
cc968520a120:/go/src/github.com/rancher/rancher # kubectl exec -it -n test-ns-56jsf test-node-kjlwc /bin/bash
test-node-kjlwc:/ # k3s --version
k3s version v1.27.6+k3s1 (bd04941a)
go version go1.20.8
test-node-kjlwc:/ # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 22:26:e8:c5:ad:1e brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.42.0.19/24 brd 10.42.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::2026:e8ff:fec5:ad1e/64 scope link
valid_lft forever preferred_lft forever
test-node-kjlwc:/ # cat /etc/rancher/k3s/config.yaml.d/50-rancher.yaml
{
"agent-token": "p7sqslqkkd9bmgnjd9bvqfwhjlzddhxz6k6zk8tqxxmq6w4xk5wznf",
"cluster-cidr": "10.44.0.0/16",
"disable-apiserver": true,
"disable-controller-manager": true,
"disable-scheduler": true,
"etcd-disable-snapshots": true,
"node-label": [
"cattle.io/os=linux",
"rke.cattle.io/machine=cb4ed48d-3245-4b20-8ca8-0193e69ac924"
],
"node-taint": [
"node-role.kubernetes.io/etcd:NoExecute"
],
"private-registry": "/etc/rancher/k3s/registries.yaml",
"server": "https://10.42.0.19:6443",
"service-cidr": "10.45.0.0/16",
"token": "pk4q4nc8xsq7hpkttsj7456rndfjl9tz46zxwdsfsxm2ztdfn9d2bq"
}
The test data bundle from the logfile is available here: https://gist.github.com/brandond/2dd31e8be6a52c863ebcc10681aaf5f4
I don't know if my host doesn't have enough memory or what, but my systemd-node pods keep getting killed, and when this happens they come up without the state from their previous run so everything gets very confused. This host has 32GB of RAM so it should be fine...
d69d03b2bff6:/go/src/github.com/rancher/rancher # kubectl get pod -A -l custom-cluster-name=test-custom-certificate-rotation-operations
NAMESPACE NAME READY STATUS RESTARTS AGE
test-ns-s25qm test-node-87tn8 1/1 Running 0 10m
test-ns-s25qm test-node-8zr29 1/1 Running 2 (9m36s ago) 10m
test-ns-s25qm test-node-9stph 1/1 Running 1 (9m37s ago) 10m
d69d03b2bff6:/go/src/github.com/rancher/rancher # kubectl describe pod -n test-ns-s25qm test-node-8zr29
Name: test-node-8zr29
Namespace: test-ns-s25qm
Priority: 0
Service Account: default
Node: local-node/172.17.0.9
Start Time: Wed, 03 Jul 2024 17:16:06 +0000
Labels: custom-cluster-name=test-custom-certificate-rotation-operations
Annotations: <none>
Status: Running
IP: 10.42.0.18
IPs:
IP: 10.42.0.18
Containers:
container:
Container ID: containerd://f994b3c813b3475af2fb05ca38d18a81ebfbeb7a1fd10602ab4b1845d897056f
Image: rancher/systemd-node:v0.0.5-rc2
Image ID: docker.io/rancher/systemd-node@sha256:370897b78f5d3e8841b446017474f24e2a2530c26686971aafd4de1cbf792c61
Port: <none>
Host Port: <none>
State: Running
Started: Wed, 03 Jul 2024 17:17:10 +0000
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Wed, 03 Jul 2024 17:16:37 +0000
Finished: Wed, 03 Jul 2024 17:16:54 +0000
Ready: True
Restart Count: 2
Environment: <none>
Mounts:
/etc/default/k3s from systemd (rw,path="disable")
/etc/default/k3s-agent from systemd (rw,path="disable")
/etc/default/rancher-system-agent from systemd (rw,path="disable")
/etc/default/rke2-agent from systemd (rw,path="disable")
/etc/default/rke2-server from systemd (rw,path="disable")
/var/lib/cloud/seed/nocloud/user-data from seed (rw,path="user-data")
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
seed:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: test-node-data-6mj5s
Optional: false
systemd:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: test-node-configure-systemd-cz28k
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 10m default-scheduler Successfully assigned test-ns-s25qm/test-node-8zr29 to local-node
Warning BackOff 10m kubelet Back-off restarting failed container container in pod test-node-8zr29_test-ns-s25qm(37d00b0f-94c0-4014-9154-02ffe6279f4e)
Normal Pulled 9m55s (x3 over 10m) kubelet Container image "rancher/systemd-node:v0.0.5-rc2" already present on machine
Normal Created 9m55s (x3 over 10m) kubelet Created container container
Normal Started 9m55s (x3 over 10m) kubelet Started container container
@Oats87 I think I'm going to need your help to collect more data on this, since I don't seem to be able to get your test script running locally for whatever reason. Is there any way we can modify the test harness to collect logs from from the systemd-node pods once the test has been completed, so that we can extract timing data from the different stages of the certificate rotation process and figure out where the delays are coming from?
@brandond what is your underlying OS? unfortunately provisioning tests are not compatible with unified cgroups, which is why I'm using ubuntu 18.04
ah yeah I'm on 24.04 with cgroups v2; I'll have to spin up an older VM if it requires v1.
Alright, seems to be working on a 18.04 VM. I am running tests with this commit added to dump the pod logs: https://github.com/brandond/rancher/commit/90f330aa56634206fa512e08aa8b09cc66e99dfe
Currently testing with
VERSIONS="v1.27.12+k3s1 v1.27.13+k3s1 v1.27.14+k3s1 v1.28.8+k3s1 v1.28.9+k3s1 v1.28.10+k3s1"
It'll take a while to run but unsurprisingly K3s is a good bit faster than RKE2 so far:
run-Test_Operation_SetA_Custom_CertificateRotation-v1.27.12+k3s1-0-1.log:--- PASS: Test_Operation_SetA_Custom_CertificateRotation (154.85s)
run-Test_Operation_SetA_Custom_CertificateRotation-v1.27.12+k3s1-0-2.log:--- PASS: Test_Operation_SetA_Custom_CertificateRotation (150.27s)
run-Test_Operation_SetA_Custom_CertificateRotation-v1.27.12+k3s1-0-3.log:--- PASS: Test_Operation_SetA_Custom_CertificateRotation (167.95s)
run-Test_Operation_SetA_Custom_CertificateRotation-v1.27.12+k3s1-0-4.log:--- PASS: Test_Operation_SetA_Custom_CertificateRotation (156.41s)
@Oats87 is your recorded time the time reported by the test itself as shown above, or the full delta between the Test start time
and Test end time
entries in the log output?
I left some tests running yesterday during the holiday, and after a number of runs for each version on each distro, came up with the following averages for the time reported by the PASS: Test_Operation_SetA_Custom_CertificateRotation
log output.
Version | AVERAGE | STDEV | MAX |
---|---|---|---|
v1.27.12+k3s1 | 157 | 6 | 169 |
v1.27.13+k3s1 | 160 | 10 | 184 |
v1.27.14+k3s1 | 159 | 8 | 177 |
v1.28.8+k3s1 | 155 | 6 | 167 |
v1.28.9+k3s1 | 157 | 5 | 165 |
v1.28.10+k3s1 | 156 | 5 | 165 |
v1.27.12+rke2r1 | 443 | 27 | 499 |
v1.27.13+rke2r1 | 487 | 59 | 612 |
v1.27.14+rke2r1 | 464 | 47 | 598 |
v1.28.8+rke2r1 | 639 | 173 | 999 |
v1.28.9+rke2r1 | 626 | 97 | 856 |
v1.28.10+rke2r1 | 633 | 102 | 839 |
If we pick on the run with the longest time (999 seconds) we can see that even in this case the actual rotation itself takes less than a second:
run-Test_Operation_SetA_Custom_CertificateRotation-v1.28.8+rke2r1-2-5.log
Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
--- PASS: Test_Operation_SetA_Custom_CertificateRotation (999.37s)
We see similar timing for the rotation on the shortest run:
run-Test_Operation_SetA_Custom_CertificateRotation-v1.28.8+rke2r1-2-2.log
Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
--- PASS: Test_Operation_SetA_Custom_CertificateRotation (513.15s)
Lets compare the full logs for the slow cluster to the fast cluster
run-Test_Operation_SetA_Custom_CertificateRotation-v1.28.8+rke2r1-2-5.log
Jul 05 14:04:40 test-node-brpwr systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 14:04:59 test-node-brpwr rke2[381]: time="2024-07-05T14:04:59Z" level=info msg="ETCD server is now running"
Jul 05 14:04:59 test-node-brpwr rke2[381]: time="2024-07-05T14:04:59Z" level=info msg="Waiting for API server to become available"
Jul 05 14:04:59 test-node-brpwr systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 14:05:01 test-node-brpwr rke2[381]: time="2024-07-05T14:05:01Z" level=info msg="Pod for etcd is synced"
Jul 05 14:05:11 test-node-brpwr rke2[381]: time="2024-07-05T14:05:11Z" level=info msg="Waiting for API server to become available"
Jul 05 14:05:15 test-node-5vpdb rke2[377]: time="2024-07-05T14:05:15Z" level=info msg="Waiting for API server to become available"
Jul 05 14:05:15 test-node-5vpdb systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 14:05:41 test-node-5vpdb rke2[377]: time="2024-07-05T14:05:41Z" level=info msg="Kube API server is now running"
Jul 05 14:05:41 test-node-brpwr rke2[381]: time="2024-07-05T14:05:41Z" level=info msg="Waiting for API server to become available"
Jul 05 14:05:44 test-node-5vpdb systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 14:07:57 test-node-c9thj systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Jul 05 14:08:07 test-node-c9thj systemd[1]: Started Rancher Kubernetes Engine v2 (agent).
Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 14:09:11 test-node-brpwr systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
Jul 05 14:09:11 test-node-brpwr rke2[381]: time="2024-07-05T14:09:11Z" level=fatal msg="context canceled"
Jul 05 14:09:11 test-node-brpwr systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 14:09:11 test-node-brpwr systemd[1]: Starting Rancher Kubernetes Engine v2 (server)..
Jul 05 14:09:59 test-node-brpwr rke2[2781]: time="2024-07-05T14:09:59Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:10:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:10:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:10:19 test-node-brpwr rke2[2781]: time="2024-07-05T14:10:19Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:10:26 test-node-5vpdb rke2[377]: time="2024-07-05T14:10:26Z" level=fatal msg="leaderelection lost for rke2-etcd"
Jul 05 14:10:31 test-node-5vpdb systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 14:10:31 test-node-5vpdb systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 14:10:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:10:39Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:10:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:10:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:10:59 test-node-brpwr rke2[2781]: time="2024-07-05T14:10:59Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:11:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:11:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:11:19 test-node-brpwr rke2[2781]: time="2024-07-05T14:11:19Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:11:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:11:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:11:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:11:39Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:11:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:11:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:11:59 test-node-brpwr rke2[2781]: time="2024-07-05T14:11:59Z" level=info msg="Pod for etcd is synced"
Jul 05 14:12:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:12:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:08 test-node-5vpdb rke2[5840]: time="2024-07-05T14:12:08Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:12:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:12:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:38 test-node-5vpdb rke2[5840]: time="2024-07-05T14:12:38Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:12:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:43 test-node-brpwr rke2[2781]: time="2024-07-05T14:12:43Z" level=info msg="ETCD server is now running"
Jul 05 14:12:43 test-node-brpwr rke2[2781]: time="2024-07-05T14:12:43Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:43 test-node-brpwr systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 14:12:43 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:12:43Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
Jul 05 14:12:44 test-node-c9thj rancher-system-agent[233]: time="2024-07-05T14:12:44Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent]"
Jul 05 14:12:44 test-node-c9thj systemd[1]: Stopping Rancher Kubernetes Engine v2 (agent)...
Jul 05 14:12:44 test-node-c9thj rke2[376]: time="2024-07-05T14:12:44Z" level=fatal msg="context canceled"
Jul 05 14:12:44 test-node-c9thj systemd[1]: Stopped Rancher Kubernetes Engine v2 (agent).
Jul 05 14:12:44 test-node-c9thj systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Jul 05 14:13:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:13:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:09 test-node-5vpdb rke2[5840]: time="2024-07-05T14:13:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:13:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:15 test-node-brpwr rke2[2781]: time="2024-07-05T14:13:15Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:13:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:39 test-node-5vpdb rke2[5840]: time="2024-07-05T14:13:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:13:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:45 test-node-brpwr rke2[2781]: time="2024-07-05T14:13:45Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:14:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:09 test-node-5vpdb rke2[5840]: time="2024-07-05T14:14:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:14:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:15 test-node-brpwr rke2[2781]: time="2024-07-05T14:14:15Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:14:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:39 test-node-5vpdb rke2[5840]: time="2024-07-05T14:14:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:14:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:45 test-node-brpwr rke2[2781]: time="2024-07-05T14:14:45Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:15:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:09 test-node-5vpdb rke2[5840]: time="2024-07-05T14:15:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:15:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:17 test-node-brpwr rke2[2781]: time="2024-07-05T14:15:17Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:15:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:15:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:41 test-node-5vpdb rke2[5840]: time="2024-07-05T14:15:41Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:47 test-node-brpwr rke2[2781]: time="2024-07-05T14:15:47Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:16:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:16:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:11 test-node-5vpdb rke2[5840]: time="2024-07-05T14:16:11Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:19 test-node-brpwr rke2[2781]: time="2024-07-05T14:16:19Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:16:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:16:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:43 test-node-5vpdb rke2[5840]: time="2024-07-05T14:16:43Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:49 test-node-brpwr rke2[2781]: time="2024-07-05T14:16:49Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:17:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:17:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:13 test-node-5vpdb rke2[5840]: time="2024-07-05T14:17:13Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:19 test-node-brpwr rke2[2781]: time="2024-07-05T14:17:19Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:17:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:17:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:43 test-node-5vpdb rke2[5840]: time="2024-07-05T14:17:43Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:51 test-node-brpwr rke2[2781]: time="2024-07-05T14:17:51Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:18:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:15 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:15Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:22 test-node-brpwr rke2[2781]: time="2024-07-05T14:18:22Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:18:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:45 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:45Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:52 test-node-brpwr rke2[2781]: time="2024-07-05T14:18:52Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:53 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:53Z" level=info msg="Kube API server is now running"
Jul 05 14:18:55 test-node-5vpdb systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 14:18:59 test-node-c9thj systemd[1]: Started Rancher Kubernetes Engine v2 (agent).
Jul 05 14:18:59 test-node-c9thj rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent] finished with err: <nil> and exit code: 0"
Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 14:18:59 test-node-5vpdb systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
Jul 05 14:18:59 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:59Z" level=fatal msg="context canceled"
Jul 05 14:19:00 test-node-5vpdb systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 14:19:00 test-node-5vpdb systemd[1]: Starting Rancher Kubernetes Engine v2 (server)..
Jul 05 14:20:20 test-node-5vpdb rke2[7753]: time="2024-07-05T14:20:20Z" level=info msg="Waiting for API server to become available"
Jul 05 14:20:38 test-node-5vpdb rke2[7753]: time="2024-07-05T14:20:38Z" level=info msg="Kube API server is now running"
Jul 05 14:20:40 test-node-5vpdb systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 14:20:40 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:20:40Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
We see the same behavior (failure to renew lease on the control-plane server) on the faster run:
run-Test_Operation_SetA_Custom_CertificateRotation-v1.28.8+rke2r1-2-2.log
Jul 05 13:17:09 test-node-mjl4s systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 13:17:29 test-node-mjl4s rke2[381]: time="2024-07-05T13:17:29Z" level=info msg="Waiting for API server to become available"
Jul 05 13:17:29 test-node-mjl4s rke2[381]: time="2024-07-05T13:17:29Z" level=info msg="ETCD server is now running"
Jul 05 13:17:29 test-node-mjl4s systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 13:17:29 test-node-mjl4s rke2[381]: time="2024-07-05T13:17:29Z" level=info msg="Pod for etcd is synced"
Jul 05 13:17:39 test-node-mjl4s rke2[381]: time="2024-07-05T13:17:39Z" level=info msg="Waiting for API server to become available"
Jul 05 13:17:44 test-node-tcgjt rke2[378]: time="2024-07-05T13:17:44Z" level=info msg="Waiting for API server to become available"
Jul 05 13:17:44 test-node-tcgjt systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 13:18:09 test-node-mjl4s rke2[381]: time="2024-07-05T13:18:09Z" level=info msg="Waiting for API server to become available"
Jul 05 13:18:09 test-node-tcgjt rke2[378]: time="2024-07-05T13:18:09Z" level=info msg="Kube API server is now running"
Jul 05 13:18:12 test-node-tcgjt systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 13:20:11 test-node-bljrc systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Jul 05 13:20:22 test-node-bljrc systemd[1]: Started Rancher Kubernetes Engine v2 (agent).
Jul 05 13:20:56 test-node-bljrc rancher-system-agent[232]: time="2024-07-05T13:20:56Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent]"
Jul 05 13:20:56 test-node-bljrc systemd[1]: Stopping Rancher Kubernetes Engine v2 (agent)...
Jul 05 13:20:56 test-node-bljrc rke2[374]: time="2024-07-05T13:20:56Z" level=fatal msg="context canceled"
Jul 05 13:20:56 test-node-bljrc systemd[1]: Stopped Rancher Kubernetes Engine v2 (agent).
Jul 05 13:20:56 test-node-bljrc systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Jul 05 13:21:01 test-node-bljrc systemd[1]: Started Rancher Kubernetes Engine v2 (agent).
Jul 05 13:21:01 test-node-bljrc rancher-system-agent[232]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent] finished with err: <nil> and exit code: 0"
Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 13:21:01 test-node-mjl4s systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
Jul 05 13:21:01 test-node-mjl4s rke2[381]: time="2024-07-05T13:21:01Z" level=fatal msg="context canceled"
Jul 05 13:21:01 test-node-mjl4s systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 13:21:01 test-node-mjl4s systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 13:21:47 test-node-mjl4s rke2[1798]: time="2024-07-05T13:21:47Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:21:57 test-node-mjl4s rke2[1798]: time="2024-07-05T13:21:57Z" level=info msg="Waiting for API server to become available"
Jul 05 13:22:07 test-node-mjl4s rke2[1798]: time="2024-07-05T13:22:07Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:22:16 test-node-tcgjt rke2[378]: time="2024-07-05T13:22:16Z" level=fatal msg="leaderelection lost for rke2-etcd"
Jul 05 13:22:21 test-node-tcgjt systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 13:22:21 test-node-tcgjt systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 13:22:27 test-node-mjl4s rke2[1798]: time="2024-07-05T13:22:27Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:22:27 test-node-mjl4s rke2[1798]: time="2024-07-05T13:22:27Z" level=info msg="Waiting for API server to become available"
Jul 05 13:22:47 test-node-mjl4s rke2[1798]: time="2024-07-05T13:22:47Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:22:57 test-node-mjl4s rke2[1798]: time="2024-07-05T13:22:57Z" level=info msg="Waiting for API server to become available"
Jul 05 13:23:07 test-node-mjl4s rke2[1798]: time="2024-07-05T13:23:07Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:23:24 test-node-tcgjt rke2[6448]: time="2024-07-05T13:23:24Z" level=info msg="Waiting for API server to become available"
Jul 05 13:23:27 test-node-mjl4s rke2[1798]: time="2024-07-05T13:23:27Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:23:27 test-node-mjl4s rke2[1798]: time="2024-07-05T13:23:27Z" level=info msg="Waiting for API server to become available"
Jul 05 13:23:47 test-node-mjl4s rke2[1798]: time="2024-07-05T13:23:47Z" level=info msg="Pod for etcd is synced"
Jul 05 13:23:54 test-node-tcgjt rke2[6448]: time="2024-07-05T13:23:54Z" level=info msg="Waiting for API server to become available"
Jul 05 13:23:54 test-node-tcgjt rke2[6448]: time="2024-07-05T13:23:54Z" level=info msg="Waiting for API server to become available"
Jul 05 13:23:57 test-node-mjl4s rke2[1798]: time="2024-07-05T13:23:57Z" level=info msg="Waiting for API server to become available"
Jul 05 13:24:22 test-node-mjl4s rke2[1798]: time="2024-07-05T13:24:22Z" level=info msg="Waiting for API server to become available"
Jul 05 13:24:22 test-node-mjl4s systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 13:24:22 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 13:24:22 test-node-tcgjt rke2[6448]: time="2024-07-05T13:24:22Z" level=fatal msg="Failed to start kube-proxy: open /var/lib/rancher/rke2/agent/client-kube-proxy.crt: no such file or directory"
Jul 05 13:24:22 test-node-tcgjt rke2[6448]: time="2024-07-05T13:24:22Z" level=fatal msg="open /var/lib/rancher/rke2/server/tls/client-scheduler.crt: no such file or directory"
Jul 05 13:24:23 test-node-tcgjt systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 13:24:23 test-node-tcgjt systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 13:24:27 test-node-mjl4s rke2[1798]: time="2024-07-05T13:24:27Z" level=info msg="Waiting for API server to become available"
Jul 05 13:24:43 test-node-tcgjt rke2[6915]: time="2024-07-05T13:24:43Z" level=info msg="Waiting for API server to become available"
Jul 05 13:24:49 test-node-tcgjt rke2[6915]: time="2024-07-05T13:24:49Z" level=info msg="Kube API server is now running"
Jul 05 13:25:03 test-node-tcgjt systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 13:25:03 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:25:03Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
Moving this to a separate comment for emphasis:
It appears that the root cause of the delays is that v2prov is not following the documented order for certificate rotation. On the slow cluster it is rotating etcd, then agent, then control-plane, and the agent restart takes forever because both the control-plane and agent nodes are experiencing an outage due to the datastore going unavailable. On the fast cluster it is rotating agent, then etcd, then control-plane, and the agent restarts quickly although it is ineffective because the server hasn't been rotated yet. Looking at the logs from many different runs, it appears the order is random?
# slow cluster
Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 14:12:43 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:12:43Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
Jul 05 14:12:44 test-node-c9thj rancher-system-agent[233]: time="2024-07-05T14:12:44Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent]"
Jul 05 14:18:59 test-node-c9thj rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent] finished with err: <nil> and exit code: 0"
Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 14:20:40 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:20:40Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
# fast cluster
Jul 05 13:20:56 test-node-bljrc rancher-system-agent[232]: time="2024-07-05T13:20:56Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent]"
Jul 05 13:21:01 test-node-bljrc rancher-system-agent[232]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent] finished with err: <nil> and exit code: 0"
Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 13:24:22 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 13:25:03 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:25:03Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
Server nodes must ALWAYS be rotated before agents, and the etcd nodes should be rotated before control-plane nodes. The docs at https://docs.rke2.io/advanced#certificate-rotation cover the server->agent node order requirement, although they do not express a preference for etcd/control-plane in the case of split-role clusters. It is also rotating the certificates then restarting the service, which is not what the docs say to do. The service should be stopped, the certificates rotated, then the service started again. I don't know that we have extensively validated simply rotating then restarting.
I will also note that the v2prov tests are using an unsupported configuration with a single etcd node and a single control-plane node. In this configuration the restart of the single etcd node frequently takes down both the control-plane and agent nodes, which is why the test takes so long to complete.
K3s is faster because k3s does everything faster, and the control-plane node doesn't have time to crash and take down the agent, as etcd comes back quickly enough to avoid an outage that causes the controller lease to expire.
I don't think it's worth investigating why the 1.27 kubelet is faster to reconcile the static pods than 1.28; I suspect it has to do with upstream changes to the sync loop that were intended to fix issues with volume attachment. If we can get v2prov changed to observe the correct order of operations, and the tests changed to use a supported cluster configuration, and we still see issues... then we can try to optimize the timing.
Created rancher/rancher issue for tracking: https://github.com/rancher/rancher/issues/46034
Environmental Info: RKE2 Version:
v1.27.10+rke2r1
to currentDescribe the bug: Over time, we have observed the certificate rotation operations taking longer and longer to complete.
Steps To Reproduce: I used a test script on an
ubuntu 18.04
VM running Docker20.10
, and used the Rancher provisioning tests for this onrelease/v2.9
The script I used can be seen here:
and requires a clone of the Rancher repo. It also assumes you are
root
on the machine, and that rancher is cloned to/root/rancher
but this can easily be modified.Expected behavior: RKE2 certificate rotation times don't take longer to complete.
Actual behavior: From
v1.27.10+rke2r1
tov1.28.8+rke2r1
the average test iteration time increased by 2.7 minutesAdditional context / logs:
Test_Operation_SetA_Custom_CertificateRotation-v1.27-v1.28-Results.xlsx