Closed fabioaraujopt closed 2 years ago
Is cert manager running and are all the control plane instances up to date?
Pod "indentity webhook" never started, so we tried to rollback.
Are there any error messages on ReplicaSet or Deployment? I want to know why the pod is not started.
I confirmed this behavior.
kops update cluster
The cert-manager is not deployed because I haven't run kops rolling-update
yet. But pod-identity-webhook has been deployed. Of course, the pod is not running because the cert is not prepared.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 81s default-scheduler Successfully assigned kube-system/pod-identity-webhook-699644494c-4zddg to ip-172-20-65-113.ap-northeast-1.compute.internal
Warning FailedMount 17s (x8 over 80s) kubelet MountVolume.SetUp failed for volume "cert" : secret "pod-identity-webhook-cert" not found
And Webhook Configuration is deployed, so the new pod can not run.
$ k get mutatingwebhookconfiguration pod-identity-webhook
NAME WEBHOOKS AGE
pod-identity-webhook 1 8m18s
OK, I will fix this issue.
/assign
Good that you can reproduce, but it sounds odd that installing cert-manager requires rolling update. It shouldn't. So I am guessing something is blocking cert-manager
Yes, I will investigate why cert-manager is not deployed.
Oh, I understand.
When spec.certManager.managed: false
, cert-manager is not deployed.
https://github.com/h3poteto/kops/blob/812014788926660e183181955f07d43aedaa0ea8/upup/pkg/fi/cloudup/bootstrapchannelbuilder/bootstrapchannelbuilder.go#L600
@fabioaraujopt Please remove spec.certManager.managed
line, or please specify spec.certManager.managed: true
. If you this, cert-manager will be deployed, and pod-identity-webhook will run.
This fixed the problem.
However, as our cluster had the broken pod_indentity_webhook every action on cluster was broken. We needed to manually delete the MutatingWebHook in order to restore the cluster.
Usingkubectl get MutatingWebhookConfiguration --all-namespaces
and the respective delete command.
However the kops should validate this right? If someoen putmanaged=false # pod_identity_webhook=true
it shoudl not allow update?
managed=false
is special option.
The following cert-manager configuration allows provisioning cert-manager externally and allows all dependent plugins to be deployed. Please note that addons might run into errors until cert-manager is deployed.
https://kops.sigs.k8s.io/addons/#cert-manager
So I think that kops should not validate this behavior.
I agree that managed=false
puts you in "know what you are doing" territory. It is not in itself a broken config. If one self-installs cert-manager one may want to use DNS validation, which in turn may need IRSA. So trying to get cert-manager to ignore the hook is not ideal either. But it does place oneself in a chicken/egg situation. Luckily one that is quite easy to get out of (deploy cert-manager, then webhook, then restart cert-manager pods).
The AWS Load Balancer Controller's validating Webhook also relies on cert-manager's CA injector, so its Webhook won't function correctly without cert-manager installed either.
What I haven't tested yet is whether kOps will declare a cluster with the Pod Identity Webhook and the AWS Load Balancer Controller to be usable if we tell kOps to not manage cert-manager, but we haven't installed it yet. I suspect that the cluster won't settle, because the MutatingWebhookConfiguration for the Pod Identity Webhook will intervene in creating many pods. Fortunately, it ignores pods labeled with "kops.k8s.io/managed-by" with a value of "kops."
/kind bug
1. What
kops
version are you running? The commandkops version
, will display this information. Version 1.23.02. What Kubernetes version are you running?
kubectl version
will print the version if a cluster is running or provide the Kubernetes version specified as akops
flag. Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.8"}3. What cloud provider are you using? AWS
4. What commands did you run? What is the simplest way to reproduce this issue? We are trying to deprecate kiam in favor of kOps IRSA, so we made the following changes in configuration.
Pod "indentity webhook" never started, so we tried to rollback.
5. What happened after the commands executed? Pod identity webhook appeared in the cluster but failed, failing to mount volume "certs" Any scheduled pod failed to start with error:
Error occurred: Internal error occurred: failed calling webhook "pod-identity-webhook.amazonaws.com": Post "https://pod-identity-webhook.kube-system.svc:443/mutate?timeout=10s": dial tcp 10.0.18.232:443: connect: connection refused
We tried to rollback all changes, by placing all values to default. Manually deleted all configMaps and pod_identity deployments. The problem still happening even after rollback.
6. What did you expect to happen?
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest. You may want to remove your cluster name and other sensitive information.8. Please run the commands with most verbose logging by adding the
-v 10
flag. Paste the logs into this report, or in a gist and provide the gist link here.9. Anything else do we need to know? We are trying to deprecate KIAM in favor of kops IRSA. Can KIAM and kops IRSA coexist?