rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.57k stars 268 forks source link

Certificate rotation has progressively become slower from v1.27 to now #6278

Closed Oats87 closed 4 months ago

Oats87 commented 4 months ago

Environmental Info: RKE2 Version: v1.27.10+rke2r1 to current

Describe the bug: Over time, we have observed the certificate rotation operations taking longer and longer to complete.

Steps To Reproduce: I used a test script on an ubuntu 18.04 VM running Docker 20.10, and used the Rancher provisioning tests for this on release/v2.9

The script I used can be seen here:

#!/bin/bash

# Space delimited versions of RKE2 or K3s to run

V2PROV_TEST_RUN_REGEX=Test_Operation_SetA_Custom_CertificateRotation
V2PROV_TEST_DIST=rke2
VERSIONS="v1.27.10+rke2r1 v1.27.11+rke2r1 v1.27.12+rke2r1 v1.27.13+rke2r1 v1.27.14+rke2r1 v1.28.2+rke2r1 v1.28.4+rke2r1 v1.28.6+rke2r1 v1.28.8+rke2r1"
ITERATIONS=5
GENERATION_MAX=20

# Absolute path...
LOG_DIR="/root/output"
RANCHER_REPO_DIR="/root/rancher"

mkdir -p ${LOG_DIR}

pushd ${RANCHER_REPO_DIR}

if [ -z "$(cat ${RANCHER_REPO_DIR}/Dockerfile.dapper | grep "SOME_K8S_VERSION")" ]; then
    echo "Mutating Dockerfile.dapper with SOME_K8S_VERSION..."
    sed -i 's/^ENV DAPPER_ENV.*/& SOME_K8S_VERSION/' ${RANCHER_REPO_DIR}/Dockerfile.dapper
    git commit Dockerfile.dapper -m "add SOME_K8S_VERSION"
fi

#GENERATION=-5

echo "Running V2PROV Tests"
echo "V2PROV_TEST_RUN_REGEX=${V2PROV_TEST_RUN_REGEX}"
echo "V2PROV_TEST_DIST=${V2PROV_TEST_DIST}"
echo "VERSIONS=${VERSIONS}"
echo "ITERATIONS=${ITERATIONS}"
echo "LOG_DIR=${LOG_DIR}"

if [ -z ${GENERATION} ] && [ -f "${LOG_DIR}/generation" ]; then
    GENERATION=$((cat ${LOG_DIR}/generation));
    if [ -z "${GENERATION}" ]; then
        GENERATION=0
    else
        GENERATION=$((GENERATION+1))
    fi
fi

if [ -z "${GENERATION}" ]; then
        GENERATION=0
fi

echo "GENERATION=${GENERATION}"
echo ${GENERATION} > ${LOG_DIR}/generation

for GENERATION in $(seq $GENERATION $GENERATION_MAX); do
    for SOME_K8S_VERSION in ${VERSIONS}; do
        for i in $(seq 1 ${ITERATIONS}); do
            docker system prune -f --all
            echo "Running tests with ${SOME_K8S_VERSION}: generation: ${GENERATION}; iteration ${i} at $(date)"
            TEST_LOG=${LOG_DIR}/run-${V2PROV_TEST_RUN_REGEX}-${SOME_K8S_VERSION}-${GENERATION}-${i}.log
            echo "Test start time: $(date)" >> ${TEST_LOG}
            V2PROV_TEST_DIST=${V2PROV_TEST_DIST} V2PROV_TEST_RUN_REGEX=${V2PROV_TEST_RUN_REGEX} SOME_K8S_VERSION=${SOME_K8S_VERSION} make provisioning-tests &>> ${LOG_DIR}/run-${V2PROV_TEST_RUN_REGEX}-${SOME_K8S_VERSION}-${GENERATION}-${i}.log
            echo "Test end time: $(date)" >> ${TEST_LOG}
        done
    done
done
popd

and requires a clone of the Rancher repo. It also assumes you are root on the machine, and that rancher is cloned to /root/rancher but this can easily be modified.

Expected behavior: RKE2 certificate rotation times don't take longer to complete.

Actual behavior: From v1.27.10+rke2r1 to v1.28.8+rke2r1 the average test iteration time increased by 2.7 minutes

Additional context / logs:

Screenshot 2024-07-02 at 1 10 04 PM

Test_Operation_SetA_Custom_CertificateRotation-v1.27-v1.28-Results.xlsx

brandond commented 4 months ago

The pattern you're showing here doesn't make any sense to me. If we look at the release dates:

date version average
2023-09-18 v1.27.6+rke2r1
v1.28.2+rke2r1 529
2023-10-31 v1.27.7+rke2r1
v1.28.3+rke2r1
2023-11-08 v1.27.7+rke2r2
v1.28.3+rke2r2
2023-12-05 v1.27.8+rke2r1
v1.28.4+rke2r1 536
2023-12-26 v1.27.9+rke2r1
v1.28.5+rke2r1
2024-02-06 v1.27.10+rke2r1 377
v1.28.6+rke2r1 531
2024-02-29 v1.27.11+rke2r1 375
v1.28.7+rke2r1
2024-03-20 v1.27.12+rke2r1 399
v1.28.8+rke2r1 540
2024-04-29 v1.27.13+rke2r1 431
v1.28.9+rke2r1
2024-05-22 v1.27.14+rke2r1 430
v1.28.10+rke2r1

First of all, this is hard to strictly compare because you appear to be somewhat arbitrarily picking the versions to test. There is some overlap in the middle, but by and large you've picked older 1.28 releases, and newer 1.27 releases. Some of the 1.28 releases which you show as being slower, are actually much older than the 1.27 releases that are faster. There are only two releases where we have data for both minors, and there appears to be no correlation at all in speed - if anything it suggests that there is something in Kubernetes 1.28 itself that is slower?

It looks like you've got this set up to collect logs from the runs, can you share those logs so that we can perhaps see where it's waiting?

cwayne18 commented 4 months ago

Just to ensure I'm getting the right context here, is this causing tests to timeout and fail? Or are we trying to make sure we maintain velocity? (Both are equally important I'm just trying to understand the effects better)

cwayne18 commented 4 months ago

@brandond it might be good if we (or QA) can do some runs against 1.29 releases as well to see if there's any difference there. My gut feel is that something is off with 1.28, but I've certainly been wrong before

snasovich commented 4 months ago

The pattern you're showing here doesn't make any sense to me. If we look at the release dates: ...

Looking at the table the main takeaway for me is that it's 1.28 itself where there is a noticeable slowdown of cert rotation operation (~20-25% on average). Combined with (much smaller) slowdown in newer patches of 1.27 the overall performance loss is >40%. So focusing on 1.28 "itself" (vs. monthly patches) changes could probably be a good start.

Just to ensure I'm getting the right context here, is this causing tests to timeout and fail? Or are we trying to make sure we maintain velocity? (Both are equally important I'm just trying to understand the effects better)

Yes, our timeout for executing single test is 600s and we've started hitting it quite a lot (see https://github.com/rancher/rancher/issues/45577) and temporarily pinned version to v1.27.10+rke2r1 to improve CI stability.

brandond commented 4 months ago

If it's in Kubernetes 1.28 itself there is not necessarily much we can do about it, but I'll try to investigate.

I'm trying to use the script from @Oats87 to test with the current code from the release/v2.9 branch and it appears something weird is going on... using both rke2 and k3s, the cluster fails to even come up, apparently because the initial etcd node is being configured to join itself (server points at its own IP), instead of being started with cluster-init: true?

cc968520a120:/go/src/github.com/rancher/rancher # kubectl get pod -A -o wide
NAMESPACE                         NAME                                      READY   STATUS      RESTARTS      AGE   IP           NODE         NOMINATED NODE   READINESS GATES
cattle-fleet-local-system         fleet-agent-0                             2/2     Running     0             12m   10.42.0.11   local-node   <none>           <none>
cattle-fleet-system               fleet-controller-6dd465cd9-x2gzz          3/3     Running     0             12m   10.42.0.6    local-node   <none>           <none>
cattle-fleet-system               gitjob-f696768db-5gtg5                    1/1     Running     0             12m   10.42.0.5    local-node   <none>           <none>
cattle-provisioning-capi-system   capi-controller-manager-9b9c8ccf8-9x568   1/1     Running     0             11m   10.42.0.15   local-node   <none>           <none>
cattle-system                     helm-operation-5rfht                      0/2     Completed   0             12m   10.42.0.4    local-node   <none>           <none>
cattle-system                     helm-operation-9lzgj                      0/2     Completed   0             12m   10.42.0.3    local-node   <none>           <none>
cattle-system                     helm-operation-cc2np                      0/2     Completed   0             11m   10.42.0.21   local-node   <none>           <none>
cattle-system                     helm-operation-gmcrp                      0/2     Completed   0             12m   10.42.0.9    local-node   <none>           <none>
cattle-system                     helm-operation-h4df9                      0/2     Completed   0             11m   10.42.0.14   local-node   <none>           <none>
cattle-system                     helm-operation-lm57n                      0/2     Completed   0             12m   10.42.0.12   local-node   <none>           <none>
cattle-system                     rancher-webhook-5f69889cd8-v2mht          1/1     Running     0             12m   10.42.0.13   local-node   <none>           <none>
default                           registry-cache                            1/1     Running     0             11m   10.42.0.17   local-node   <none>           <none>
kube-system                       coredns-576bfc4dc7-kxmjf                  1/1     Running     0             13m   10.42.0.2    local-node   <none>           <none>
test-ns-56jsf                     test-node-bspzv                           1/1     Running     2 (10m ago)   11m   10.42.0.18   local-node   <none>           <none>
test-ns-56jsf                     test-node-kjlwc                           1/1     Running     1 (10m ago)   11m   10.42.0.19   local-node   <none>           <none>
test-ns-56jsf                     test-node-p9hgz                           1/1     Running     0             11m   10.42.0.20   local-node   <none>           <none>

cc968520a120:/go/src/github.com/rancher/rancher # kubectl logs --tail=10 -n test-ns-56jsf test-node-kjlwc
Jul 03 16:53:13 test-node-kjlwc sh[6924]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Jul 03 16:53:13 test-node-kjlwc sh[6925]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jul 03 16:53:13 test-node-kjlwc (modprobe)[6926]: k3s.service: Executable /sbin/modprobe missing, skipping: No such file or directory
Jul 03 16:53:13 test-node-kjlwc (modprobe)[6927]: k3s.service: Executable /sbin/modprobe missing, skipping: No such file or directory
Jul 03 16:53:13 test-node-kjlwc k3s[6928]: time="2024-07-03T16:53:13Z" level=info msg="ETCD snapshots are disabled"
Jul 03 16:53:13 test-node-kjlwc k3s[6928]: time="2024-07-03T16:53:13Z" level=info msg="Starting k3s v1.27.6+k3s1 (bd04941a)"
Jul 03 16:53:13 test-node-kjlwc k3s[6928]: time="2024-07-03T16:53:13Z" level=fatal msg="starting kubernetes: preparing server: failed to get CA certs: Get \"https://10.42.0.19:6443/cacerts\": dial tcp 10.42.0.19:6443: connect: connection refused"
Jul 03 16:53:13 test-node-kjlwc systemd[1]: k3s.service: Main process exited, code=exited, status=1/FAILURE
Jul 03 16:53:13 test-node-kjlwc systemd[1]: k3s.service: Failed with result 'exit-code'.
Jul 03 16:53:13 test-node-kjlwc systemd[1]: Failed to start Lightweight Kubernetes.

cc968520a120:/go/src/github.com/rancher/rancher # kubectl exec -it -n test-ns-56jsf test-node-kjlwc /bin/bash
test-node-kjlwc:/ # k3s --version
k3s version v1.27.6+k3s1 (bd04941a)
go version go1.20.8

test-node-kjlwc:/ # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0@if21: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
    link/ether 22:26:e8:c5:ad:1e brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.42.0.19/24 brd 10.42.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::2026:e8ff:fec5:ad1e/64 scope link
       valid_lft forever preferred_lft forever

test-node-kjlwc:/ # cat /etc/rancher/k3s/config.yaml.d/50-rancher.yaml
{
  "agent-token": "p7sqslqkkd9bmgnjd9bvqfwhjlzddhxz6k6zk8tqxxmq6w4xk5wznf",
  "cluster-cidr": "10.44.0.0/16",
  "disable-apiserver": true,
  "disable-controller-manager": true,
  "disable-scheduler": true,
  "etcd-disable-snapshots": true,
  "node-label": [
    "cattle.io/os=linux",
    "rke.cattle.io/machine=cb4ed48d-3245-4b20-8ca8-0193e69ac924"
  ],
  "node-taint": [
    "node-role.kubernetes.io/etcd:NoExecute"
  ],
  "private-registry": "/etc/rancher/k3s/registries.yaml",
  "server": "https://10.42.0.19:6443",
  "service-cidr": "10.45.0.0/16",
  "token": "pk4q4nc8xsq7hpkttsj7456rndfjl9tz46zxwdsfsxm2ztdfn9d2bq"
}

The test data bundle from the logfile is available here: https://gist.github.com/brandond/2dd31e8be6a52c863ebcc10681aaf5f4

brandond commented 4 months ago

I don't know if my host doesn't have enough memory or what, but my systemd-node pods keep getting killed, and when this happens they come up without the state from their previous run so everything gets very confused. This host has 32GB of RAM so it should be fine...

d69d03b2bff6:/go/src/github.com/rancher/rancher # kubectl get pod -A -l custom-cluster-name=test-custom-certificate-rotation-operations
NAMESPACE       NAME              READY   STATUS    RESTARTS        AGE
test-ns-s25qm   test-node-87tn8   1/1     Running   0               10m
test-ns-s25qm   test-node-8zr29   1/1     Running   2 (9m36s ago)   10m
test-ns-s25qm   test-node-9stph   1/1     Running   1 (9m37s ago)   10m

d69d03b2bff6:/go/src/github.com/rancher/rancher # kubectl describe pod -n test-ns-s25qm   test-node-8zr29
Name:             test-node-8zr29
Namespace:        test-ns-s25qm
Priority:         0
Service Account:  default
Node:             local-node/172.17.0.9
Start Time:       Wed, 03 Jul 2024 17:16:06 +0000
Labels:           custom-cluster-name=test-custom-certificate-rotation-operations
Annotations:      <none>
Status:           Running
IP:               10.42.0.18
IPs:
  IP:  10.42.0.18
Containers:
  container:
    Container ID:   containerd://f994b3c813b3475af2fb05ca38d18a81ebfbeb7a1fd10602ab4b1845d897056f
    Image:          rancher/systemd-node:v0.0.5-rc2
    Image ID:       docker.io/rancher/systemd-node@sha256:370897b78f5d3e8841b446017474f24e2a2530c26686971aafd4de1cbf792c61
    Port:           <none>
    Host Port:      <none>
    State:          Running
      Started:      Wed, 03 Jul 2024 17:17:10 +0000
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Wed, 03 Jul 2024 17:16:37 +0000
      Finished:     Wed, 03 Jul 2024 17:16:54 +0000
    Ready:          True
    Restart Count:  2
    Environment:    <none>
    Mounts:
      /etc/default/k3s from systemd (rw,path="disable")
      /etc/default/k3s-agent from systemd (rw,path="disable")
      /etc/default/rancher-system-agent from systemd (rw,path="disable")
      /etc/default/rke2-agent from systemd (rw,path="disable")
      /etc/default/rke2-server from systemd (rw,path="disable")
      /var/lib/cloud/seed/nocloud/user-data from seed (rw,path="user-data")
Conditions:
  Type                        Status
  PodReadyToStartContainers   True
  Initialized                 True
  Ready                       True
  ContainersReady             True
  PodScheduled                True
Volumes:
  seed:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      test-node-data-6mj5s
    Optional:  false
  systemd:
    Type:        ConfigMap (a volume populated by a ConfigMap)
    Name:        test-node-configure-systemd-cz28k
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  10m                  default-scheduler  Successfully assigned test-ns-s25qm/test-node-8zr29 to local-node
  Warning  BackOff    10m                  kubelet            Back-off restarting failed container container in pod test-node-8zr29_test-ns-s25qm(37d00b0f-94c0-4014-9154-02ffe6279f4e)
  Normal   Pulled     9m55s (x3 over 10m)  kubelet            Container image "rancher/systemd-node:v0.0.5-rc2" already present on machine
  Normal   Created    9m55s (x3 over 10m)  kubelet            Created container container
  Normal   Started    9m55s (x3 over 10m)  kubelet            Started container container
brandond commented 4 months ago

@Oats87 I think I'm going to need your help to collect more data on this, since I don't seem to be able to get your test script running locally for whatever reason. Is there any way we can modify the test harness to collect logs from from the systemd-node pods once the test has been completed, so that we can extract timing data from the different stages of the certificate rotation process and figure out where the delays are coming from?

Oats87 commented 4 months ago

@brandond what is your underlying OS? unfortunately provisioning tests are not compatible with unified cgroups, which is why I'm using ubuntu 18.04

brandond commented 4 months ago

ah yeah I'm on 24.04 with cgroups v2; I'll have to spin up an older VM if it requires v1.

brandond commented 4 months ago

Alright, seems to be working on a 18.04 VM. I am running tests with this commit added to dump the pod logs: https://github.com/brandond/rancher/commit/90f330aa56634206fa512e08aa8b09cc66e99dfe

brandond commented 4 months ago

Currently testing with

VERSIONS="v1.27.12+k3s1 v1.27.13+k3s1 v1.27.14+k3s1 v1.28.8+k3s1 v1.28.9+k3s1 v1.28.10+k3s1"

It'll take a while to run but unsurprisingly K3s is a good bit faster than RKE2 so far:

run-Test_Operation_SetA_Custom_CertificateRotation-v1.27.12+k3s1-0-1.log:--- PASS: Test_Operation_SetA_Custom_CertificateRotation (154.85s)
run-Test_Operation_SetA_Custom_CertificateRotation-v1.27.12+k3s1-0-2.log:--- PASS: Test_Operation_SetA_Custom_CertificateRotation (150.27s)
run-Test_Operation_SetA_Custom_CertificateRotation-v1.27.12+k3s1-0-3.log:--- PASS: Test_Operation_SetA_Custom_CertificateRotation (167.95s)
run-Test_Operation_SetA_Custom_CertificateRotation-v1.27.12+k3s1-0-4.log:--- PASS: Test_Operation_SetA_Custom_CertificateRotation (156.41s)

@Oats87 is your recorded time the time reported by the test itself as shown above, or the full delta between the Test start time and Test end time entries in the log output?

brandond commented 4 months ago

I left some tests running yesterday during the holiday, and after a number of runs for each version on each distro, came up with the following averages for the time reported by the PASS: Test_Operation_SetA_Custom_CertificateRotation log output.

Version AVERAGE STDEV MAX
v1.27.12+k3s1 157 6 169
v1.27.13+k3s1 160 10 184
v1.27.14+k3s1 159 8 177
v1.28.8+k3s1 155 6 167
v1.28.9+k3s1 157 5 165
v1.28.10+k3s1 156 5 165
v1.27.12+rke2r1 443 27 499
v1.27.13+rke2r1 487 59 612
v1.27.14+rke2r1 464 47 598
v1.28.8+rke2r1 639 173 999
v1.28.9+rke2r1 626 97 856
v1.28.10+rke2r1 633 102 839

image

brandond commented 4 months ago

If we pick on the run with the longest time (999 seconds) we can see that even in this case the actual rotation itself takes less than a second: run-Test_Operation_SetA_Custom_CertificateRotation-v1.28.8+rke2r1-2-5.log

        Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
        Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
        Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
        Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
--- PASS: Test_Operation_SetA_Custom_CertificateRotation (999.37s)

We see similar timing for the rotation on the shortest run: run-Test_Operation_SetA_Custom_CertificateRotation-v1.28.8+rke2r1-2-2.log

        Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
        Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
        Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
        Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
--- PASS: Test_Operation_SetA_Custom_CertificateRotation (513.15s)

Lets compare the full logs for the slow cluster to the fast cluster run-Test_Operation_SetA_Custom_CertificateRotation-v1.28.8+rke2r1-2-5.log

Jul 05 14:04:40 test-node-brpwr systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 14:04:59 test-node-brpwr rke2[381]: time="2024-07-05T14:04:59Z" level=info msg="ETCD server is now running"
Jul 05 14:04:59 test-node-brpwr rke2[381]: time="2024-07-05T14:04:59Z" level=info msg="Waiting for API server to become available"
Jul 05 14:04:59 test-node-brpwr systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 14:05:01 test-node-brpwr rke2[381]: time="2024-07-05T14:05:01Z" level=info msg="Pod for etcd is synced"
Jul 05 14:05:11 test-node-brpwr rke2[381]: time="2024-07-05T14:05:11Z" level=info msg="Waiting for API server to become available"
Jul 05 14:05:15 test-node-5vpdb rke2[377]: time="2024-07-05T14:05:15Z" level=info msg="Waiting for API server to become available"
Jul 05 14:05:15 test-node-5vpdb systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 14:05:41 test-node-5vpdb rke2[377]: time="2024-07-05T14:05:41Z" level=info msg="Kube API server is now running"
Jul 05 14:05:41 test-node-brpwr rke2[381]: time="2024-07-05T14:05:41Z" level=info msg="Waiting for API server to become available"
Jul 05 14:05:44 test-node-5vpdb systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 14:07:57 test-node-c9thj systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Jul 05 14:08:07 test-node-c9thj systemd[1]: Started Rancher Kubernetes Engine v2 (agent).
Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 14:09:11 test-node-brpwr systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
Jul 05 14:09:11 test-node-brpwr rke2[381]: time="2024-07-05T14:09:11Z" level=fatal msg="context canceled"
Jul 05 14:09:11 test-node-brpwr systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 14:09:11 test-node-brpwr systemd[1]: Starting Rancher Kubernetes Engine v2 (server)..
Jul 05 14:09:59 test-node-brpwr rke2[2781]: time="2024-07-05T14:09:59Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:10:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:10:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:10:19 test-node-brpwr rke2[2781]: time="2024-07-05T14:10:19Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:10:26 test-node-5vpdb rke2[377]: time="2024-07-05T14:10:26Z" level=fatal msg="leaderelection lost for rke2-etcd"
Jul 05 14:10:31 test-node-5vpdb systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 14:10:31 test-node-5vpdb systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 14:10:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:10:39Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:10:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:10:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:10:59 test-node-brpwr rke2[2781]: time="2024-07-05T14:10:59Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:11:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:11:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:11:19 test-node-brpwr rke2[2781]: time="2024-07-05T14:11:19Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:11:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:11:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:11:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:11:39Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 14:11:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:11:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:11:59 test-node-brpwr rke2[2781]: time="2024-07-05T14:11:59Z" level=info msg="Pod for etcd is synced"
Jul 05 14:12:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:12:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:08 test-node-5vpdb rke2[5840]: time="2024-07-05T14:12:08Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:12:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:12:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:38 test-node-5vpdb rke2[5840]: time="2024-07-05T14:12:38Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:12:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:43 test-node-brpwr rke2[2781]: time="2024-07-05T14:12:43Z" level=info msg="ETCD server is now running"
Jul 05 14:12:43 test-node-brpwr rke2[2781]: time="2024-07-05T14:12:43Z" level=info msg="Waiting for API server to become available"
Jul 05 14:12:43 test-node-brpwr systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 14:12:43 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:12:43Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
Jul 05 14:12:44 test-node-c9thj rancher-system-agent[233]: time="2024-07-05T14:12:44Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent]"
Jul 05 14:12:44 test-node-c9thj systemd[1]: Stopping Rancher Kubernetes Engine v2 (agent)...
Jul 05 14:12:44 test-node-c9thj rke2[376]: time="2024-07-05T14:12:44Z" level=fatal msg="context canceled"
Jul 05 14:12:44 test-node-c9thj systemd[1]: Stopped Rancher Kubernetes Engine v2 (agent).
Jul 05 14:12:44 test-node-c9thj systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Jul 05 14:13:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:13:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:09 test-node-5vpdb rke2[5840]: time="2024-07-05T14:13:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:13:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:15 test-node-brpwr rke2[2781]: time="2024-07-05T14:13:15Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:13:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:39 test-node-5vpdb rke2[5840]: time="2024-07-05T14:13:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:13:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:13:45 test-node-brpwr rke2[2781]: time="2024-07-05T14:13:45Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:14:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:09 test-node-5vpdb rke2[5840]: time="2024-07-05T14:14:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:14:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:15 test-node-brpwr rke2[2781]: time="2024-07-05T14:14:15Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:14:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:39 test-node-5vpdb rke2[5840]: time="2024-07-05T14:14:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:14:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:14:45 test-node-brpwr rke2[2781]: time="2024-07-05T14:14:45Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:15:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:09 test-node-5vpdb rke2[5840]: time="2024-07-05T14:15:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:15:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:17 test-node-brpwr rke2[2781]: time="2024-07-05T14:15:17Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:15:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:15:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:41 test-node-5vpdb rke2[5840]: time="2024-07-05T14:15:41Z" level=info msg="Waiting for API server to become available"
Jul 05 14:15:47 test-node-brpwr rke2[2781]: time="2024-07-05T14:15:47Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:16:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:16:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:11 test-node-5vpdb rke2[5840]: time="2024-07-05T14:16:11Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:19 test-node-brpwr rke2[2781]: time="2024-07-05T14:16:19Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:16:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:16:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:43 test-node-5vpdb rke2[5840]: time="2024-07-05T14:16:43Z" level=info msg="Waiting for API server to become available"
Jul 05 14:16:49 test-node-brpwr rke2[2781]: time="2024-07-05T14:16:49Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:17:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:17:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:13 test-node-5vpdb rke2[5840]: time="2024-07-05T14:17:13Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:19 test-node-brpwr rke2[2781]: time="2024-07-05T14:17:19Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:17:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:17:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:43 test-node-5vpdb rke2[5840]: time="2024-07-05T14:17:43Z" level=info msg="Waiting for API server to become available"
Jul 05 14:17:51 test-node-brpwr rke2[2781]: time="2024-07-05T14:17:51Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:04 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:04Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:09 test-node-brpwr rke2[2781]: time="2024-07-05T14:18:09Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:15 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:15Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:22 test-node-brpwr rke2[2781]: time="2024-07-05T14:18:22Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:34 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:34Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:39 test-node-brpwr rke2[2781]: time="2024-07-05T14:18:39Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:45 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:45Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:52 test-node-brpwr rke2[2781]: time="2024-07-05T14:18:52Z" level=info msg="Waiting for API server to become available"
Jul 05 14:18:53 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:53Z" level=info msg="Kube API server is now running"
Jul 05 14:18:55 test-node-5vpdb systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 14:18:59 test-node-c9thj systemd[1]: Started Rancher Kubernetes Engine v2 (agent).
Jul 05 14:18:59 test-node-c9thj rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent] finished with err: <nil> and exit code: 0"
Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 14:18:59 test-node-5vpdb systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
Jul 05 14:18:59 test-node-5vpdb rke2[5840]: time="2024-07-05T14:18:59Z" level=fatal msg="context canceled"
Jul 05 14:19:00 test-node-5vpdb systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 14:19:00 test-node-5vpdb systemd[1]: Starting Rancher Kubernetes Engine v2 (server)..
Jul 05 14:20:20 test-node-5vpdb rke2[7753]: time="2024-07-05T14:20:20Z" level=info msg="Waiting for API server to become available"
Jul 05 14:20:38 test-node-5vpdb rke2[7753]: time="2024-07-05T14:20:38Z" level=info msg="Kube API server is now running"
Jul 05 14:20:40 test-node-5vpdb systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 14:20:40 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:20:40Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"

We see the same behavior (failure to renew lease on the control-plane server) on the faster run: run-Test_Operation_SetA_Custom_CertificateRotation-v1.28.8+rke2r1-2-2.log

Jul 05 13:17:09 test-node-mjl4s systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 13:17:29 test-node-mjl4s rke2[381]: time="2024-07-05T13:17:29Z" level=info msg="Waiting for API server to become available"
Jul 05 13:17:29 test-node-mjl4s rke2[381]: time="2024-07-05T13:17:29Z" level=info msg="ETCD server is now running"
Jul 05 13:17:29 test-node-mjl4s systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 13:17:29 test-node-mjl4s rke2[381]: time="2024-07-05T13:17:29Z" level=info msg="Pod for etcd is synced"
Jul 05 13:17:39 test-node-mjl4s rke2[381]: time="2024-07-05T13:17:39Z" level=info msg="Waiting for API server to become available"
Jul 05 13:17:44 test-node-tcgjt rke2[378]: time="2024-07-05T13:17:44Z" level=info msg="Waiting for API server to become available"
Jul 05 13:17:44 test-node-tcgjt systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 13:18:09 test-node-mjl4s rke2[381]: time="2024-07-05T13:18:09Z" level=info msg="Waiting for API server to become available"
Jul 05 13:18:09 test-node-tcgjt rke2[378]: time="2024-07-05T13:18:09Z" level=info msg="Kube API server is now running"
Jul 05 13:18:12 test-node-tcgjt systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 13:20:11 test-node-bljrc systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Jul 05 13:20:22 test-node-bljrc systemd[1]: Started Rancher Kubernetes Engine v2 (agent).
Jul 05 13:20:56 test-node-bljrc rancher-system-agent[232]: time="2024-07-05T13:20:56Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent]"
Jul 05 13:20:56 test-node-bljrc systemd[1]: Stopping Rancher Kubernetes Engine v2 (agent)...
Jul 05 13:20:56 test-node-bljrc rke2[374]: time="2024-07-05T13:20:56Z" level=fatal msg="context canceled"
Jul 05 13:20:56 test-node-bljrc systemd[1]: Stopped Rancher Kubernetes Engine v2 (agent).
Jul 05 13:20:56 test-node-bljrc systemd[1]: Starting Rancher Kubernetes Engine v2 (agent)...
Jul 05 13:21:01 test-node-bljrc systemd[1]: Started Rancher Kubernetes Engine v2 (agent).
Jul 05 13:21:01 test-node-bljrc rancher-system-agent[232]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent] finished with err: <nil> and exit code: 0"
Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 13:21:01 test-node-mjl4s systemd[1]: Stopping Rancher Kubernetes Engine v2 (server)...
Jul 05 13:21:01 test-node-mjl4s rke2[381]: time="2024-07-05T13:21:01Z" level=fatal msg="context canceled"
Jul 05 13:21:01 test-node-mjl4s systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 13:21:01 test-node-mjl4s systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 13:21:47 test-node-mjl4s rke2[1798]: time="2024-07-05T13:21:47Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:21:57 test-node-mjl4s rke2[1798]: time="2024-07-05T13:21:57Z" level=info msg="Waiting for API server to become available"
Jul 05 13:22:07 test-node-mjl4s rke2[1798]: time="2024-07-05T13:22:07Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:22:16 test-node-tcgjt rke2[378]: time="2024-07-05T13:22:16Z" level=fatal msg="leaderelection lost for rke2-etcd"
Jul 05 13:22:21 test-node-tcgjt systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 13:22:21 test-node-tcgjt systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 13:22:27 test-node-mjl4s rke2[1798]: time="2024-07-05T13:22:27Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:22:27 test-node-mjl4s rke2[1798]: time="2024-07-05T13:22:27Z" level=info msg="Waiting for API server to become available"
Jul 05 13:22:47 test-node-mjl4s rke2[1798]: time="2024-07-05T13:22:47Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:22:57 test-node-mjl4s rke2[1798]: time="2024-07-05T13:22:57Z" level=info msg="Waiting for API server to become available"
Jul 05 13:23:07 test-node-mjl4s rke2[1798]: time="2024-07-05T13:23:07Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:23:24 test-node-tcgjt rke2[6448]: time="2024-07-05T13:23:24Z" level=info msg="Waiting for API server to become available"
Jul 05 13:23:27 test-node-mjl4s rke2[1798]: time="2024-07-05T13:23:27Z" level=info msg="Pod for etcd not synced (pod sandbox not found), retrying"
Jul 05 13:23:27 test-node-mjl4s rke2[1798]: time="2024-07-05T13:23:27Z" level=info msg="Waiting for API server to become available"
Jul 05 13:23:47 test-node-mjl4s rke2[1798]: time="2024-07-05T13:23:47Z" level=info msg="Pod for etcd is synced"
Jul 05 13:23:54 test-node-tcgjt rke2[6448]: time="2024-07-05T13:23:54Z" level=info msg="Waiting for API server to become available"
Jul 05 13:23:54 test-node-tcgjt rke2[6448]: time="2024-07-05T13:23:54Z" level=info msg="Waiting for API server to become available"
Jul 05 13:23:57 test-node-mjl4s rke2[1798]: time="2024-07-05T13:23:57Z" level=info msg="Waiting for API server to become available"
Jul 05 13:24:22 test-node-mjl4s rke2[1798]: time="2024-07-05T13:24:22Z" level=info msg="Waiting for API server to become available"
Jul 05 13:24:22 test-node-mjl4s systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 13:24:22 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate]"
Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/rotate 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b b050f2ef23a1468ca6fa1689d5742f62253e863f05942834002032bc088378e3 rke2 /var/lib/rancher/capr certificate rotate] finished with err: <nil> and exit code: 0"
Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 13:24:22 test-node-tcgjt rke2[6448]: time="2024-07-05T13:24:22Z" level=fatal msg="Failed to start kube-proxy: open /var/lib/rancher/rke2/agent/client-kube-proxy.crt: no such file or directory"
Jul 05 13:24:22 test-node-tcgjt rke2[6448]: time="2024-07-05T13:24:22Z" level=fatal msg="open /var/lib/rancher/rke2/server/tls/client-scheduler.crt: no such file or directory"
Jul 05 13:24:23 test-node-tcgjt systemd[1]: Stopped Rancher Kubernetes Engine v2 (server).
Jul 05 13:24:23 test-node-tcgjt systemd[1]: Starting Rancher Kubernetes Engine v2 (server)...
Jul 05 13:24:27 test-node-mjl4s rke2[1798]: time="2024-07-05T13:24:27Z" level=info msg="Waiting for API server to become available"
Jul 05 13:24:43 test-node-tcgjt rke2[6915]: time="2024-07-05T13:24:43Z" level=info msg="Waiting for API server to become available"
Jul 05 13:24:49 test-node-tcgjt rke2[6915]: time="2024-07-05T13:24:49Z" level=info msg="Kube API server is now running"
Jul 05 13:25:03 test-node-tcgjt systemd[1]: Started Rancher Kubernetes Engine v2 (server).
Jul 05 13:25:03 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:25:03Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
brandond commented 4 months ago

Moving this to a separate comment for emphasis:

It appears that the root cause of the delays is that v2prov is not following the documented order for certificate rotation. On the slow cluster it is rotating etcd, then agent, then control-plane, and the agent restart takes forever because both the control-plane and agent nodes are experiencing an outage due to the datastore going unavailable. On the fast cluster it is rotating agent, then etcd, then control-plane, and the agent restarts quickly although it is ineffective because the server hasn't been rotated yet. Looking at the logs from many different runs, it appears the order is random?

# slow cluster
Jul 05 14:09:11 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:09:11Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 14:12:43 test-node-brpwr rancher-system-agent[233]: time="2024-07-05T14:12:43Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
Jul 05 14:12:44 test-node-c9thj rancher-system-agent[233]: time="2024-07-05T14:12:44Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent]"
Jul 05 14:18:59 test-node-c9thj rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent] finished with err: <nil> and exit code: 0"
Jul 05 14:18:59 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:18:59Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 14:20:40 test-node-5vpdb rancher-system-agent[233]: time="2024-07-05T14:20:40Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"

# fast cluster
Jul 05 13:20:56 test-node-bljrc rancher-system-agent[232]: time="2024-07-05T13:20:56Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent]"
Jul 05 13:21:01 test-node-bljrc rancher-system-agent[232]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-agent] finished with err: <nil> and exit code: 0"
Jul 05 13:21:01 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:21:01Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 13:24:22 test-node-mjl4s rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"
Jul 05 13:24:22 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:24:22Z" level=info msg="[Applyinator] Running command: /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server]"
Jul 05 13:25:03 test-node-tcgjt rancher-system-agent[233]: time="2024-07-05T13:25:03Z" level=info msg="[Applyinator] Command /bin/sh [-x /var/lib/rancher/capr/idempotence/idempotent.sh certificate-rotation/restart-restart 6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b ff4b294dcb57179cad146355dcfdca5fb7f516963b73f25f09171ec6d2fb0954 systemctl /var/lib/rancher/capr restart rke2-server] finished with err: <nil> and exit code: 0"

Server nodes must ALWAYS be rotated before agents, and the etcd nodes should be rotated before control-plane nodes. The docs at https://docs.rke2.io/advanced#certificate-rotation cover the server->agent node order requirement, although they do not express a preference for etcd/control-plane in the case of split-role clusters. It is also rotating the certificates then restarting the service, which is not what the docs say to do. The service should be stopped, the certificates rotated, then the service started again. I don't know that we have extensively validated simply rotating then restarting.

I will also note that the v2prov tests are using an unsupported configuration with a single etcd node and a single control-plane node. In this configuration the restart of the single etcd node frequently takes down both the control-plane and agent nodes, which is why the test takes so long to complete.

K3s is faster because k3s does everything faster, and the control-plane node doesn't have time to crash and take down the agent, as etcd comes back quickly enough to avoid an outage that causes the controller lease to expire.

I don't think it's worth investigating why the 1.27 kubelet is faster to reconcile the static pods than 1.28; I suspect it has to do with upstream changes to the sync loop that were intended to fix issues with volume attachment. If we can get v2prov changed to observe the correct order of operations, and the tests changed to use a supported cluster configuration, and we still see issues... then we can try to optimize the timing.

brandond commented 4 months ago

Created rancher/rancher issue for tracking: https://github.com/rancher/rancher/issues/46034