Investigate how service-account-issuer work

AverageMarcus commented 1 year ago

Need to investigate exactly how the service-account-issuer flag works in api-server when using a custom URL.

Upgrades to CAPA clusters are very unstable due to pods becoming unauthorized during rolling of controlplane nodes. We need to understand why the existing service account tokens become invalid during this process even though the URL used for the issuer doesn't change.

AverageMarcus commented 1 year ago

Attempts to replicate the upgrade issue using a Workload Cluster have failed.

Using as similar values as possible, I've tried the same upgrade from 0.18.0 to 0.20.2 that caused the issues last week with the MC and noticed a couple differences:

No "bearer token" errors in the apiserver logs, everything continues working
~The worker nodes also roll. When I did the upgrades to the MCs last week only the control plane nodes rolled.~

This makes me think the issue may be related to some TTL expiring or similar so have created several test WCs to leave for a few days before attempting the same.

EDIT: The worker nodes rolling is because I mistakenly upgraded to 0.20.2 (which includes the change to ubuntu version) whereas last week I performed an upgrade from 0.18.0 -> 0.20.1. Performing the same upgrade on a WC does just roll the control plane nodes as expected.

AverageMarcus commented 1 year ago

Some implementation details discovered while debugging:

The Token Controller, part of kube-controller-manager, acts on ServiceAccounts and secrets - watching for deletions of each and cleaning up the references in the other. This does not manage generating / refreshing of tokens.
The ServiceAccount addmission controller is only responsible for ensuring a pod has the appropriate ServiceAccount-related fields populated on creation (e.g. .spec.serviceAccountName and handling the token volume mount)
The tokens are generated by calling the TokenRequest API automatically on pod launch thanks to the serviceAccountToken projected volume set up by the kubelet. This API creates tokens that expire either when the pod is deleted or after a defined lifespan that is defaulted to 1 hour.
Tokens can be manually requested using the kubectl create token command.
Kubelet is responsible for the projected token and will perform the following actions:
- request and store the token on behalf of the Pod
- make the token available to the Pod at a configurable file path
- refresh the token as it approaches expiration
Kubelet will proactively rotate the token if over 80% of its TTL

Based on the above, I believe the issue lies in the kubelet. Unfortunately we don't have logs persisted from the kubelet so can't investigate until we can replicate the issue. When we next perform an MC upgrade we should tail the kubelet logs on each of the (old and new) control plane nodes to look out for relevant error messages.

AverageMarcus commented 1 year ago

The current kubelet logs available on both the tornado MC (where the issue presented last week but contains new control plane nodes) and the keep1 WC I created yesterday don't show anything out of the ordinary.

AverageMarcus commented 1 year ago

Test case `keep1`:

Cluster age: 22h Action performed: Upgrade App from 0.18.0 to 0.20.1 Upgraded cleanly: Yes

Notes: The apiserver logs contained a small number of authentication logs (below) but they were temporary and it doesn't seem to have effected the health of the cluster at all.

kube-apiserver-ip-10-0-212-164.eu-west-2.compute.internal kube-apiserver E1213 15:29:38.634005       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-212-164.eu-west-2.compute.internal kube-apiserver E1213 15:29:41.378151       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-212-164.eu-west-2.compute.internal kube-apiserver E1213 15:29:43.636165       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-212-164.eu-west-2.compute.internal kube-apiserver E1213 15:29:46.380304       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"

While this error is similar to what was seen during the previous MC upgrades it wasn't printed anywhere near as often or as many times and the error includes an additional note of service account token has been invalidated.

AverageMarcus commented 1 year ago

Test case `keep2`:

Cluster age: 2d14h Action performed: Upgrade App from 0.18.0 to 0.20.1 Upgraded cleanly: Yes

Notes:

Some similar errors in the logs to keep1 but nothing persistent.

kube-apiserver-ip-10-0-80-61.eu-west-2.compute.internal kube-apiserver E1215 08:03:37.736541       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-80-61.eu-west-2.compute.internal kube-apiserver E1215 08:03:42.738910       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-78-60.eu-west-2.compute.internal kube-apiserver E1215 08:03:21.578948       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, context deadline exceeded]"
kube-apiserver-ip-10-0-228-80.eu-west-2.compute.internal kube-apiserver E1215 08:03:39.771686       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-228-80.eu-west-2.compute.internal kube-apiserver E1215 08:03:44.774180       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-228-80.eu-west-2.compute.internal kube-apiserver E1215 08:03:47.800197       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-228-80.eu-west-2.compute.internal kube-apiserver E1215 08:03:47.800999       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-228-80.eu-west-2.compute.internal kube-apiserver E1215 08:03:47.801661       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"

AverageMarcus commented 1 year ago

Test case `keep2` - take 2:

Cluster age: 2d14h Action performed:

Manually delete the keep2-sa secret from the MC
Upgrade App from 0.20.1 to 0.20.2 Upgraded cleanly: No

Notes:

This upgrade causes all nodes to cycle, including workers.

Some new logs in this situation:

kube-apiserver-ip-10-0-252-80.eu-west-2.compute.internal kube-apiserver E1215 08:16:43.535948       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"
kube-apiserver-ip-10-0-252-80.eu-west-2.compute.internal kube-apiserver E1215 08:16:43.535934       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"
kube-apiserver-ip-10-0-252-80.eu-west-2.compute.internal kube-apiserver E1215 08:16:47.776249       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"

These logs are repeated a lot.

This causes similar issues to what we've seen during upgrades but in a very different way and with different logs.

I think it's safe to say the [cluster]-sa secret didn't change during the faulty upgrades

AverageMarcus commented 1 year ago

We hit this issue again today with the upgrade of goat.

Root cause still unknown but I managed to rule out some possible reasons:

The service-account-token secrets don't change during the upgrade so the service accounts still have the correct token assigned to them
The goat-ca and goat-sa secrets didn't change so the keys used to generate tokens remains the same

Some differences seen compared to tests performed on workload clusters:

The kubelet logs on the first control plane node being removed shows etcd crash looping because it has been removed from the etcd members. I would assume this also happens with WCs but I believe they are deleted quicker so doesn't get to this point.

Dec 16 09:25:01 ip-10-224-128-148 kubelet[1401]: I1216 09:25:01.699484    1401 scope.go:110] "RemoveContainer" containerID="84522cbabffe063f63f45729ccd5fd1e841bc29f640678ecad0ebe3ff448ae3a"
Dec 16 09:25:04 ip-10-224-128-148 kubelet[1401]: I1216 09:25:04.707774    1401 scope.go:110] "RemoveContainer" containerID="84522cbabffe063f63f45729ccd5fd1e841bc29f640678ecad0ebe3ff448ae3a"
Dec 16 09:25:04 ip-10-224-128-148 kubelet[1401]: I1216 09:25:04.708189    1401 scope.go:110] "RemoveContainer" containerID="1e5118c36253eb247e1a42db466f5c10854c90410a47ada01c9ee5a7821c3459"
Dec 16 09:25:04 ip-10-224-128-148 kubelet[1401]: E1216 09:25:04.708633    1401 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 10s restarting failed container=etcd pod=etcd-ip-10-224-128-148.eu-north-1.compute.internal_kube-system(8f8e2f81e748ac114db703b7124de587)\"" pod="kube-system/etcd-ip-10-224-128-148.eu-north-1.compute.internal" podUID=8f8e2f81e748ac114db703b7124de587
Dec 16 09:25:05 ip-10-224-128-148 kubelet[1401]: I1216 09:25:05.711469    1401 scope.go:110] "RemoveContainer" containerID="1e5118c36253eb247e1a42db466f5c10854c90410a47ada01c9ee5a7821c3459"
Dec 16 09:25:05 ip-10-224-128-148 kubelet[1401]: E1216 09:25:05.711814    1401 pod_workers.go:951] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"etcd\" with CrashLoopBackOff: \"back-off 10s restarting failed container=etcd pod=etcd-ip-10-224-128-148.eu-north-1.compute.internal_kube-system(8f8e2f81e748ac114db703b7124de587)\"" pod="kube-system/etcd-ip-10-224-128-148.eu-north-1.compute.internal" podUID=8f8e2f81e748ac114db703b7124de587

The syslog from the first new control plane node includes the following:

Dec 16 09:24:34 ip-10-224-148-177 cloud-init[1078]: [2022-12-16 09:24:15] [certs] Generating "apiserver-kubelet-client" certificate and key
Dec 16 09:24:34 ip-10-224-148-177 cloud-init[1078]: [2022-12-16 09:24:16] [kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"
Dec 16 09:24:34 ip-10-224-148-177 cloud-init[1078]: [2022-12-16 09:24:16] [kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"
Dec 16 09:24:34 ip-10-224-148-177 cloud-init[1078]: [2022-12-16 09:24:16] [kubelet-start] Starting the kubelet
Dec 16 09:24:34 ip-10-224-148-177 cloud-init[1078]: [2022-12-16 09:24:16] [kubelet-start] Waiting for the kubelet to perform the TLS Bootstrap...

which is shortly followed by:

Dec 16 09:24:47 ip-10-224-148-177 kubelet[1484]: I1216 09:24:47.034920    1484 kubelet.go:1724] "Trying to delete pod" pod="kube-system/kube-apiserver-ip-10-224-148-177.eu-north-1.compute.internal" podUID=958cc32c-5119-48db-82a8-a03f90f6a89e
Dec 16 09:24:47 ip-10-224-148-177 kubelet[1484]: I1216 09:24:47.060101    1484 kubelet.go:1729] "Deleted mirror pod because it is outdated" pod="kube-system/kube-apiserver-ip-10-224-148-177.eu-north-1.compute.internal"

It looks like kubelet is set up before the apiserver has had the placeholder replaced. I tried systemctl restart kubelet but that didn't seem to help.

There are many errors in the logs of the new control plane node related to not being able to access the api server ELB:

Dec 16 09:25:23 ip-10-224-148-177 kubelet[1484]: E1216 09:25:23.257848    1484 controller.go:187] failed to update lease, error: Put "https://internal-goat-apiserver-18847525199.eu-north-1.elb.amazonaws.com:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/ip-10-224-148-177.eu-north-1.compute.internal?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

This is possibly caused by the ELB including non-ready instances (such as the currently initialising node) which can't respond to API requests.

AverageMarcus commented 1 year ago

Test case `keep3`:

Cluster age: 3d18h Action performed: Upgrade App from 0.18.0 to 0.20.1 Upgraded cleanly: Yes

Notes:

Checking the last of the possible default TTLs (72hour).

No issues.

AverageMarcus commented 1 year ago

Test case `keep4`:

Cluster age: 3d20h Action performed: Upgrade App from 0.18.0 to 0.20.1 Upgraded cleanly: Yes

Notes:

Installed Kyverno, kyverno-policies, keep4-kyverno-policies-connectivity & keep4-kyverno-policies-dx

Kyverno didn't cause any issue with the upgrade.

AverageMarcus commented 1 year ago

I have been unable to replicate the issue using workload clusters. Whatever is causing the issue seems to only be there on management clusters.

It is also entirely possible that whatever cause the problem is now already fixed in the latest versions of cluster-aws and default-apps-aws. We wont know for sure until we next upgrade an MC.

Plan for testing MC upgrades:

Pick a CAPA MC that can be tested against / not currently being actively used by others (maybe grizzly). If no current MCs fit this then we may want to create a new ephemeral MC for testing this issue.
Create a new PR to cluster-aws that introduces a change to the control plane without changing any functionality. Easiest way to I can think to ensure this is to add a new empty file to the files property of the KubeadmControlPlane. This will cause the control plane nodes only to be replaced but with functionally identical config.

See https://github.com/giantswarm/cluster-aws/pull/188
Disable Flux on the MC
Manually update the cluster App of the MC to be that built for the PR.
Wait and see if the problem occurs.
Repeat as needed by rolling back the cluster-aws to the gitops tracked version until a root cause is found.

tuladhar commented 1 year ago

We (@tuladhar and @bdehri) upgraded grizzly and golem, and didn't encounter this issue. Closing it for now, but if we encounter the issue in future we'll re-open and investigate further.

giantswarm / roadmap