kubernetes / kops

Kubernetes Operations (kOps) - Production Grade k8s Installation, Upgrades and Management
https://kops.sigs.k8s.io/
Apache License 2.0
15.97k stars 4.65k forks source link

kOps IRSA failed calling webhook "pod-identity-webhook.amazonaws.com" #13459

Closed fabioaraujopt closed 2 years ago

fabioaraujopt commented 2 years ago

/kind bug

1. What kops version are you running? The command kops version, will display this information. Version 1.23.0

2. What Kubernetes version are you running? kubectl version will print the version if a cluster is running or provide the Kubernetes version specified as a kops flag. Server Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.8"}

3. What cloud provider are you using? AWS

4. What commands did you run? What is the simplest way to reproduce this issue? We are trying to deprecate kiam in favor of kOps IRSA, so we made the following changes in configuration.

iam:
    allowContainerRegistry: true
    legacy: false
    serviceAccountExternalPermissions:
    - aws:
        policyARNs:
        - arn:aws:iam::XXX:policy/test-s3-policy
      name: test-s3-serviceaccount
      namespace: test-namespace
    useServiceAccountExternalPermissions: true
serviceAccountIssuerDiscovery:
    discoveryStore: s3://XXX
    enableAWSOIDCProvider: true
podIdentityWebhook:
    enabled: true
certManager:
    enabled: true
    managed: false

Pod "indentity webhook" never started, so we tried to rollback.

5. What happened after the commands executed? Pod identity webhook appeared in the cluster but failed, failing to mount volume "certs" Any scheduled pod failed to start with error:

Error occurred: Internal error occurred: failed calling webhook "pod-identity-webhook.amazonaws.com": Post "https://pod-identity-webhook.kube-system.svc:443/mutate?timeout=10s": dial tcp 10.0.18.232:443: connect: connection refused

We tried to rollback all changes, by placing all values to default. Manually deleted all configMaps and pod_identity deployments. The problem still happening even after rollback.

6. What did you expect to happen?

7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest. You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  generation: 12
  name: XXXXX
spec:
  api:
    loadBalancer:
      class: Network
      sslCertificate: arn:aws:acm:eu-west-1:XXX:certificate/XXX
      sslPolicy: ELBSecurityPolicy-2016-08
      type: Public
  authorization:
    rbac: {}
  channel: stable
  cloudLabels:
    cost_center: eng
    environment: dev
  cloudProvider: aws
  configBase: s3://XXXX/XXXXX
  dnsZone: XXXX
  etcdClusters:
  - cpuRequest: 200m
    etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    - instanceGroup: master-eu-west-1b
      name: b
    - instanceGroup: master-eu-west-1c
      name: c
    memoryRequest: 100Mi
    name: main
  - cpuRequest: 100m
    etcdMembers:
    - instanceGroup: master-eu-west-1a
      name: a
    - instanceGroup: master-eu-west-1b
      name: b
    - instanceGroup: master-eu-west-1c
      name: c
    memoryRequest: 100Mi
    name: events
  externalPolicies:
    master:
    - arn:aws:iam::XXX:policy/XXXX
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    anonymousAuth: false
    oidcClientID: XXXXX
    oidcIssuerURL: https://accounts.google.com
    oidcUsernameClaim: email
  kubelet:
    anonymousAuth: false
    authenticationTokenWebhook: true
    authorizationMode: Webhook
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.20.8
  masterPublicName: XXXXX
  networkCIDR: 10.0.0.0/16
  networkID: XXXX
  networking:
    amazonvpc: {}
  nonMasqueradeCIDR: 10.0.0.0/16
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 10.0.32.0/19
    id: XXX
    name: eu-west-1a
    type: Private
    zone: eu-west-1a
  - cidr: 10.0.64.0/19
    id: XXX
    name: eu-west-1b
    type: Private
    zone: eu-west-1b
  - cidr: 10.0.96.0/19
    id: XXX
    name: eu-west-1c
    type: Private
    zone: eu-west-1c
  - cidr: 10.0.0.0/22
    id: XXX
    name: utility-eu-west-1a
    type: Utility
    zone: eu-west-1a
  - cidr: 10.0.4.0/22
    id: XXX
    name: utility-eu-west-1b
    type: Utility
    zone: eu-west-1b
  - cidr: 10.0.8.0/22
    id: XXX
    name: utility-eu-west-1c
    type: Utility
    zone: eu-west-1c
  topology:
    bastion:
      bastionPublicName: XXXXX
    dns:
      type: Public
    masters: private
    nodes: private

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-12-02T16:10:51Z"
  generation: 5
  labels:
    kops.k8s.io/cluster: XXXXX
  name: apps-spot-2-cpu-8-ram
spec:
  cloudLabels:
    cost_center: eng
    environment: dev
  image: XXXX
  machineType: t3a.large
  maxSize: 10
  minSize: 0
  mixedInstancesPolicy:
    instances:
    - t3.large
    - m5a.large
    - m5.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: lowest-price
  nodeLabels:
    kops.k8s.io/instancegroup: apps-spot-2-cpu-8-ram
  role: Node
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-12-02T16:10:51Z"
  generation: 5
  labels:
    kops.k8s.io/cluster: XXXXX
  name: apps-spot-4-cpu-16-ram
spec:
  cloudLabels:
    cost_center: eng
    environment: dev
  image: XXX
  machineType: t3a.xlarge
  maxSize: 5
  minSize: 0
  mixedInstancesPolicy:
    instances:
    - t3.xlarge
    - m5a.xlarge
    - m5.xlarge
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: lowest-price
  nodeLabels:
    kops.k8s.io/instancegroup: apps-spot-4-cpu-16-ram
  role: Node
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-12-02T16:10:51Z"
  generation: 3
  labels:
    kops.k8s.io/cluster: XXXXX
  name: apps-spot-8-cpu-32-ram
spec:
  cloudLabels:
    cost_center: eng
    environment: dev
  image: XXX
  machineType: t3a.2xlarge
  maxSize: 5
  minSize: 0
  mixedInstancesPolicy:
    instances:
    - t3.2xlarge
    - m5a.2xlarge
    - m5.2xlarge
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: lowest-price
  nodeLabels:
    kops.k8s.io/instancegroup: apps-spot-8-cpu-32-ram
  role: Node
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-07-27T13:08:07Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: XXXX
  name: bastions
spec:
  cloudLabels:
    cost_center: eng
    environment: dev
  image: XXX
  machineType: t2.micro
  maxSize: 1
  minSize: 1
  nodeLabels:
    kops.k8s.io/instancegroup: bastions
  role: Bastion
  subnets:
  - utility-eu-west-1a
  - utility-eu-west-1b
  - utility-eu-west-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-07-27T13:08:07Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: XXXX
  name: critical-apps
spec:
  cloudLabels:
    cost_center: eng
    environment: dev
  image: XXXX
  machineType: t3a.xlarge
  maxSize: 3
  minSize: 3
  nodeLabels:
    CriticalAddonsOnly: "true"
    kops.k8s.io/instancegroup: critical-apps
  role: Node
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c
  taints:
  - CriticalAddonsOnly=true:NoSchedule

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-11-29T17:07:55Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: XXXXX
  name: master-eu-west-1a
spec:
  cloudLabels:
    cost_center: eng
    environment: dev
  image: XXXX
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3.medium
    onDemandAboveBase: 1
    onDemandBase: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1a
  role: Master
  subnets:
  - eu-west-1a

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-11-29T17:07:55Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: XXXX
  name: master-eu-west-1b
spec:
  cloudLabels:
    cost_center: eng
    environment: dev
  image: XXX
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3.medium
    onDemandAboveBase: 1
    onDemandBase: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1b
  role: Master
  subnets:
  - eu-west-1b

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-11-29T17:07:55Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: XXXX
  name: master-eu-west-1c
spec:
  cloudLabels:
    cost_center: eng
    environment: dev
  image: XXX
  machineType: t3a.medium
  maxSize: 1
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - t3.medium
    onDemandAboveBase: 1
    onDemandBase: 1
  nodeLabels:
    kops.k8s.io/instancegroup: master-eu-west-1c
  role: Master
  subnets:
  - eu-west-1c

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-12-29T16:43:50Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: XXXX
  name: XXXXX
spec:
  cloudLabels:
    cost_center: eng
    environment: dev
    k8s.io/cluster-autoscaler/node-template/label/XXXX: apps
  image: XXXX
  machineType: t3a.large
  maxSize: 2
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - m5a.large
    - m5.large
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: lowest-price
  nodeLabels:
    kops.k8s.io/instancegroup: XXX
    wsudOnly: "true"
  role: Node
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c
  taints:
  - wsudOnly=true:NoSchedule

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-12-29T16:43:50Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: XXXX
  name: XXXX
spec:
  cloudLabels:
    cost_center: eng
    environment: dev
    k8s.io/cluster-autoscaler/node-template/label/XXXX: coordinator
  image: XXX
  machineType: t3a.xlarge
  maxSize: 2
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - m5a.xlarge
    - m5.xlarge
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: lowest-price
  nodeLabels:
    kops.k8s.io/instancegroup: XXXX
    wsudOnly: "true"
  role: Node
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c
  taints:
  - wsudOnly=true:NoSchedule

---

apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2021-12-29T17:11:06Z"
  generation: 2
  labels:
    kops.k8s.io/cluster: XXXX
  name: XXXX
spec:
  cloudLabels:
    cost_center: eng
    environment: dev
    k8s.io/cluster-autoscaler/node-template/label/XX: worker
  image: XXXX
  machineType: t3a.xlarge
  maxSize: 4
  minSize: 1
  mixedInstancesPolicy:
    instances:
    - m5a.xlarge
    - m5.xlarge
    onDemandAboveBase: 0
    onDemandBase: 0
    spotAllocationStrategy: lowest-price
  nodeLabels:
    kops.k8s.io/instancegroup: XXX
    wsudOnly: "true"
  role: Node
  subnets:
  - eu-west-1a
  - eu-west-1b
  - eu-west-1c
  taints:
  - wsudOnly=true:NoSchedule

8. Please run the commands with most verbose logging by adding the -v 10 flag. Paste the logs into this report, or in a gist and provide the gist link here.

9. Anything else do we need to know? We are trying to deprecate KIAM in favor of kops IRSA. Can KIAM and kops IRSA coexist?

olemarkus commented 2 years ago

Is cert manager running and are all the control plane instances up to date?

h3poteto commented 2 years ago

Pod "indentity webhook" never started, so we tried to rollback.

Are there any error messages on ReplicaSet or Deployment? I want to know why the pod is not started.

h3poteto commented 2 years ago

I confirmed this behavior.

  1. I create a new cluster without cert-manager and pod-identity-webhook
  2. I edit the cluster to add cert-manager and pod-identity-webhook
  3. kops update cluster

The cert-manager is not deployed because I haven't run kops rolling-update yet. But pod-identity-webhook has been deployed. Of course, the pod is not running because the cert is not prepared.

Events:
  Type     Reason       Age                From               Message
  ----     ------       ----               ----               -------
  Normal   Scheduled    81s                default-scheduler  Successfully assigned kube-system/pod-identity-webhook-699644494c-4zddg to ip-172-20-65-113.ap-northeast-1.compute.internal
  Warning  FailedMount  17s (x8 over 80s)  kubelet            MountVolume.SetUp failed for volume "cert" : secret "pod-identity-webhook-cert" not found

And Webhook Configuration is deployed, so the new pod can not run.

$ k get mutatingwebhookconfiguration pod-identity-webhook 
NAME                   WEBHOOKS   AGE
pod-identity-webhook   1          8m18s
h3poteto commented 2 years ago

OK, I will fix this issue.

/assign

olemarkus commented 2 years ago

Good that you can reproduce, but it sounds odd that installing cert-manager requires rolling update. It shouldn't. So I am guessing something is blocking cert-manager

h3poteto commented 2 years ago

Yes, I will investigate why cert-manager is not deployed.

h3poteto commented 2 years ago

Oh, I understand.

When spec.certManager.managed: false, cert-manager is not deployed. https://github.com/h3poteto/kops/blob/812014788926660e183181955f07d43aedaa0ea8/upup/pkg/fi/cloudup/bootstrapchannelbuilder/bootstrapchannelbuilder.go#L600

h3poteto commented 2 years ago

@fabioaraujopt Please remove spec.certManager.managed line, or please specify spec.certManager.managed: true. If you this, cert-manager will be deployed, and pod-identity-webhook will run.

fabioaraujopt commented 2 years ago

This fixed the problem. However, as our cluster had the broken pod_indentity_webhook every action on cluster was broken. We needed to manually delete the MutatingWebHook in order to restore the cluster. Usingkubectl get MutatingWebhookConfiguration --all-namespaces and the respective delete command.

However the kops should validate this right? If someoen putmanaged=false # pod_identity_webhook=trueit shoudl not allow update?

h3poteto commented 2 years ago

managed=false is special option.

The following cert-manager configuration allows provisioning cert-manager externally and allows all dependent plugins to be deployed. Please note that addons might run into errors until cert-manager is deployed.

https://kops.sigs.k8s.io/addons/#cert-manager

So I think that kops should not validate this behavior.

olemarkus commented 2 years ago

I agree that managed=false puts you in "know what you are doing" territory. It is not in itself a broken config. If one self-installs cert-manager one may want to use DNS validation, which in turn may need IRSA. So trying to get cert-manager to ignore the hook is not ideal either. But it does place oneself in a chicken/egg situation. Luckily one that is quite easy to get out of (deploy cert-manager, then webhook, then restart cert-manager pods).

seh commented 2 years ago

The AWS Load Balancer Controller's validating Webhook also relies on cert-manager's CA injector, so its Webhook won't function correctly without cert-manager installed either.

What I haven't tested yet is whether kOps will declare a cluster with the Pod Identity Webhook and the AWS Load Balancer Controller to be usable if we tell kOps to not manage cert-manager, but we haven't installed it yet. I suspect that the cluster won't settle, because the MutatingWebhookConfiguration for the Pod Identity Webhook will intervene in creating many pods. Fortunately, it ignores pods labeled with "kops.k8s.io/managed-by" with a value of "kops."