Closed milan-stikic-cif closed 1 year ago
Unfortunately enabling IRSA on a live cluster is disruptive. I will make sure to mention this in our docs.
You need to delete any SA tokens and restart your Pods.
Hey @olemarkus thanks for the reply! We even tried deleting SA tokens but only that of cilium pods. Should we try recreating all SAs and Pods after enabling IRSA ?
Only where you are getting authorization errors should be sufficient.
Hi @olemarkus , unfortunately restarting pods and deleting SA tokens didn't solve our problem. cilium pods and cilium-operator pods are deployed as daemon set and deployment respectively. I tried deleting their service accounts along with sa tokens and then redeploying daemon set and deployment, but it seems that pods that end up on the updated master instance still cannot authenticate to api server. (making them go to CrashLoopBack with all the same errors i posted above) As our cluster grows, need for IRSA does the same, so i am guessing we should recreate the whole cluster with kOps with OIDC service discovery enabled from the start?
Did you rotate the entire control plane first? If not, the old KCM may be handing out tokens with the old issuer. Or there may be API servers not trusting the new one.
Hey, we ended up creating entirely new cluster and we don't see this problem anymore. pod-identity-webhook works as intended and service account are getting right aws permissions. To answer your question, while trying to enable IRSA on a running cluster, not even the re-creation of complete control plane helped us. We tried that by terminating master instances
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
I see similar behavior in my clusters but it is happening with flannel. unfortunately creating a new k8s cluster is not an option for us.
As mentioned above, make sure you rotate the entire control plane and then make sure you restart all your pods (if you are running a recent k8s version).
hi @olemarkus
Is it possible to add a feature to include multiple --service-account-issuer
as an option for kops? that can be helpful and reduce the disruption during the adoption of IRSA.
--service-account-issuer
defines the Identifier of the service account token issuer. You can specify the --service-account-issuer argument multiple times, this can be useful to enable a non-disruptive change of the issuer. When this flag is specified multiple times, the first is used to generate tokens and all are used to determine which issuers are accepted. You must be running Kubernetes v1.22 or later to be able to specify --service-account-issuer multiple times.
Yeah. It wouldn't be too hard to implement, I think. Would be happy to review a PR.
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
The problem we faced was that after deploying pod-identity-webhooks
at the same time as OIDC changes, cilium
pods won't start as pod-identity-webhooks
are not yet available since masters were not rolled. You can exclude "cilium" on mutating webhook related to pod-identity-webhooks
, but you still need to roll masters.
Unfortunately enabling IRSA on a live cluster is disruptive.
We have found a non-disruptive two step process below:
serviceAccountIssuerDiscovery:
discoveryStore: s3://publicly-readable-store
enableAWSOIDCProvider: false
kops update cluster --yes
you will shortly see pod-identity-webhooks
starting up, but it takes ~10 mins for them to become fully operational after which IRSA should be working).
serviceAccountIssuerDiscovery:
enableAWSOIDCProvider: true
iam:
useServiceAccountExternalPermissions: true
podIdentityWebhook:
enabled: true
Hope it helps someone who stumbles across the same issue.
@p3rshin , hey Alex, thanks for sharing! What version of kops do you use? I'm getting an error:
# Found fields that are not recognized
# + enableAWSOIDCProvider: false
I managed to make the change non-disruptive by injecting the the secondary --service-account-issuer to the kube-api manifest on the master nodes. This uses the kube-api feature that allows for non-disruptive service-account-issuer change. Docs here.
The solution uses spec.hook in the control plane InstanceGroup:
kind: InstanceGroup
spec:
hooks:
- before:
- kubelet.service
manifest: |
User=root
Type=oneshot
ExecStart=/bin/bash -c "until [ -f /etc/kubernetes/manifests/kube-apiserver.manifest ];do sleep 5;done;sed -i '/- --service-account-issuer=https:\/\/.*.amazonaws.com/a\ \ \ \ - --service-account-issuer=https:\/\/api.internal.[cluster-name].[domain]' /etc/kubernetes/manifests/kube-apiserver.manifest"
name: modify-kube-api-manifest
I understand that this solution is not great, but it helped us to move forward without modifying the kops code or doing manual interventions.
/kind bug
1. What
kops
version are you running? The commandkops version
, will display this information. Version 1.23.2 2. What Kubernetes version are you running?kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag. Version 1.21.5 3. What cloud provider are you using? Self Hosted K8S cluster on AWS 4. What commands did you run? What is the simplest way to reproduce this issue? Enabling serviceAccountIssuerDiscovery with enableAWSOIDCProvider: true and setting a bucket for JWKS, using kops rolling-update 5. What happened after the commands executed? kops decided that 3 master nodes (3 is total) need to get updated. Once it started updating first master node, it is stuck because cilium pod running on that master is getting following errors:This will eventually make other cilium pods go into CrashLoop with following message :
In the end the whole cluster becomes unusable as all pods stop working at some point 6. What did you expect to happen? That the cluster will start normally after kops rolling-update, and start using new serviceAccountIssuer using OIDC for enabling IAM Roles for Service Accounts 7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest. You may want to remove your cluster name and other sensitive information.8. Please run the commands with most verbose logging by adding the
-v 10
flag. Paste the logs into this report, or in a gist and provide the gist link here.kubelet logs are following:
, and alsom
Other cilium pods (ones that were not on updated master node), eventually get following error messages:
9. Anything else do we need to know? I was following https://dev.to/olemarkus/irsa-support-for-kops-1doe and https://dev.to/olemarkus/zero-configuration-irsa-on-kops-1po1 for enabling IRSA for self hosted k8s clusters. One difference is that we have cert-manager installed prior to trying to enable this. That is why spec.certManager has
managed: false
config. Also, kube-apiserver on updated master node never gets deployed in mean time.