giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Investigate how service-account-issuer work #1766

Closed AverageMarcus closed 1 year ago

AverageMarcus commented 1 year ago

Need to investigate exactly how the service-account-issuer flag works in api-server when using a custom URL.

Upgrades to CAPA clusters are very unstable due to pods becoming unauthorized during rolling of controlplane nodes. We need to understand why the existing service account tokens become invalid during this process even though the URL used for the issuer doesn't change.

AverageMarcus commented 1 year ago

Attempts to replicate the upgrade issue using a Workload Cluster have failed.

Using as similar values as possible, I've tried the same upgrade from 0.18.0 to 0.20.2 that caused the issues last week with the MC and noticed a couple differences:

  1. No "bearer token" errors in the apiserver logs, everything continues working
  2. ~The worker nodes also roll. When I did the upgrades to the MCs last week only the control plane nodes rolled.~

This makes me think the issue may be related to some TTL expiring or similar so have created several test WCs to leave for a few days before attempting the same.


EDIT: The worker nodes rolling is because I mistakenly upgraded to 0.20.2 (which includes the change to ubuntu version) whereas last week I performed an upgrade from 0.18.0 -> 0.20.1. Performing the same upgrade on a WC does just roll the control plane nodes as expected.

AverageMarcus commented 1 year ago

Some implementation details discovered while debugging:

Based on the above, I believe the issue lies in the kubelet. Unfortunately we don't have logs persisted from the kubelet so can't investigate until we can replicate the issue. When we next perform an MC upgrade we should tail the kubelet logs on each of the (old and new) control plane nodes to look out for relevant error messages.

AverageMarcus commented 1 year ago

The current kubelet logs available on both the tornado MC (where the issue presented last week but contains new control plane nodes) and the keep1 WC I created yesterday don't show anything out of the ordinary.

AverageMarcus commented 1 year ago

Test case keep1:

Cluster age: 22h Action performed: Upgrade App from 0.18.0 to 0.20.1 Upgraded cleanly: Yes

Notes: The apiserver logs contained a small number of authentication logs (below) but they were temporary and it doesn't seem to have effected the health of the cluster at all.

kube-apiserver-ip-10-0-212-164.eu-west-2.compute.internal kube-apiserver E1213 15:29:38.634005       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-212-164.eu-west-2.compute.internal kube-apiserver E1213 15:29:41.378151       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-212-164.eu-west-2.compute.internal kube-apiserver E1213 15:29:43.636165       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-212-164.eu-west-2.compute.internal kube-apiserver E1213 15:29:46.380304       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"

While this error is similar to what was seen during the previous MC upgrades it wasn't printed anywhere near as often or as many times and the error includes an additional note of service account token has been invalidated.

AverageMarcus commented 1 year ago

Test case keep2:

Cluster age: 2d14h Action performed: Upgrade App from 0.18.0 to 0.20.1 Upgraded cleanly: Yes

Notes:

Some similar errors in the logs to keep1 but nothing persistent.

kube-apiserver-ip-10-0-80-61.eu-west-2.compute.internal kube-apiserver E1215 08:03:37.736541       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-80-61.eu-west-2.compute.internal kube-apiserver E1215 08:03:42.738910       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-78-60.eu-west-2.compute.internal kube-apiserver E1215 08:03:21.578948       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, context deadline exceeded]"
kube-apiserver-ip-10-0-228-80.eu-west-2.compute.internal kube-apiserver E1215 08:03:39.771686       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-228-80.eu-west-2.compute.internal kube-apiserver E1215 08:03:44.774180       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-228-80.eu-west-2.compute.internal kube-apiserver E1215 08:03:47.800197       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-228-80.eu-west-2.compute.internal kube-apiserver E1215 08:03:47.800999       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
kube-apiserver-ip-10-0-228-80.eu-west-2.compute.internal kube-apiserver E1215 08:03:47.801661       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, service account token has been invalidated]"
AverageMarcus commented 1 year ago

Test case keep2 - take 2:

Cluster age: 2d14h Action performed:

Notes:

This upgrade causes all nodes to cycle, including workers.

Some new logs in this situation:

kube-apiserver-ip-10-0-252-80.eu-west-2.compute.internal kube-apiserver E1215 08:16:43.535948       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"
kube-apiserver-ip-10-0-252-80.eu-west-2.compute.internal kube-apiserver E1215 08:16:43.535934       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"
kube-apiserver-ip-10-0-252-80.eu-west-2.compute.internal kube-apiserver E1215 08:16:47.776249       1 authentication.go:63] "Unable to authenticate the request" err="[invalid bearer token, square/go-jose: error in cryptographic primitive]"

These logs are repeated a lot.

This causes similar issues to what we've seen during upgrades but in a very different way and with different logs.

I think it's safe to say the [cluster]-sa secret didn't change during the faulty upgrades

AverageMarcus commented 1 year ago

We hit this issue again today with the upgrade of goat.

Root cause still unknown but I managed to rule out some possible reasons:

Some differences seen compared to tests performed on workload clusters:

AverageMarcus commented 1 year ago

Test case keep3:

Cluster age: 3d18h Action performed: Upgrade App from 0.18.0 to 0.20.1 Upgraded cleanly: Yes

Notes:

Checking the last of the possible default TTLs (72hour).

No issues.

AverageMarcus commented 1 year ago

Test case keep4:

Cluster age: 3d20h Action performed: Upgrade App from 0.18.0 to 0.20.1 Upgraded cleanly: Yes

Notes:

Installed Kyverno, kyverno-policies, keep4-kyverno-policies-connectivity & keep4-kyverno-policies-dx

Kyverno didn't cause any issue with the upgrade.

AverageMarcus commented 1 year ago

I have been unable to replicate the issue using workload clusters. Whatever is causing the issue seems to only be there on management clusters.

It is also entirely possible that whatever cause the problem is now already fixed in the latest versions of cluster-aws and default-apps-aws. We wont know for sure until we next upgrade an MC.


Plan for testing MC upgrades:

tuladhar commented 1 year ago

We (@tuladhar and @bdehri) upgraded grizzly and golem, and didn't encounter this issue. Closing it for now, but if we encounter the issue in future we'll re-open and investigate further.