projectcapsule / capsule

Multi-tenancy and policy-based framework for Kubernetes.
https://projectcapsule.dev/
Apache License 2.0
1.62k stars 157 forks source link

capsule-mutating-webhook-configuration prevents new masters from becoming ready #128

Closed d-m closed 4 years ago

d-m commented 4 years ago

Bug description

When rotating my cluster master ASG, the capsule-mutating-webhook-configuration seemed to prevent new masters from becoming Ready when running kubectl get nodes. The master appeared functional otherwise. There were no logs in the capsule pods, however there were errors in the kube-apiserver.log, described below. Once I deleted capsule-mutating-webhook-configuration the kube-apiserver errors stopped and the masters showed as Ready.

My cluster is Kubernetes 1.18.10 deployed via Kops 1.8.2 to AWS.

How to reproduce

Steps to reproduce the behavior:

  1. Install capsule using the helm chart with the following values:
    manager:
    image:
    repository: quay.io/clastix/capsule
    pullPolicy: IfNotPresent
    tag: ''
    options:
    logLevel: '4'
    forceTenantPrefix: true
    resources:
    limits:
      cpu: 200m
      memory: 128Mi
    requests:
      cpu: 200m
      memory: 128Mi
    proxy:
    image:
    repository: gcr.io/kubebuilder/kube-rbac-proxy
    pullPolicy: IfNotPresent
    tag: "v0.5.0"
    resources:
    limits:
      cpu: 100m
      memory: 128Mi
    requests:
      cpu: 10m
      memory: 64Mi
    mutatingWebhooksTimeoutSeconds: 30
    imagePullSecrets: []
    serviceAccount:
    create: true
    annotations: {}
    name: "capsule"
    podAnnotations: {}
    priorityClassName: "system-cluster-critical"
    nodeSelector:
    node-role.kubernetes.io/master: ""
    tolerations:
    - key: CriticalAddonsOnly
    operator: Exists
    - effect: NoSchedule
    key: node-role.kubernetes.io/master
    replicaCount: 3
    affinity: {}
    podSecurityPolicy:
    enabled: true
  2. Deploy tenant with the following manifest
    apiVersion: capsule.clastix.io/v1alpha1
    kind: Tenant
    metadata:
    labels:
    annotations:
    name: test
    spec:
    owner:
    name: test-admin
    kind: User
    storageClasses:
    allowedRegex: ".*"
    allowed:
    ingressClasses:
    allowedRegex: ""
    allowed:
    - traefik
    - default
    namespaceQuota: 3
    resourceQuotas:
    - hard:
        limits.cpu: "8"
        limits.memory: 16Gi
        requests.cpu: "8"
        requests.memory: 16Gi
      scopes: ["NotTerminating"]
    - hard:
        pods : "10"
        services: "5"
    - spec:
        hard:
          requests.storage: "100Gi"
    limitRanges:
    - limits:
      - type: Pod
        min:
          cpu: "50m"
          memory: "5Mi"
        max:
          cpu: "1"
          memory: "1Gi"
      - type: Container
        defaultRequest:
          cpu: "100m"
          memory: "10Mi"
        default:
          cpu: "200m"
          memory: "100Mi"
        min:
          cpu: "50m"
          memory: "5Mi"
        max:
          cpu: "1"
          memory: "1Gi"
      - type: PersistentVolumeClaim
        min:
          storage: "1Gi"
        max:
          storage: "10Gi"
    networkPolicies:
    - policyTypes:
      - Ingress
      - Egress
      podSelector: {}
      ingress:
      - from:
        - namespaceSelector: {}
        - podSelector: {}
        - ipBlock:
            cidr: 192.168.0.0/16
      egress:
      - to:
        - ipBlock:
            cidr: 0.0.0.0/0
            except:
            - 192.168.0.0/16
  3. Rotate master ASGs

Expected behavior

New masters appear Ready when checked with kubectl get nodes.

Logs

The following error appears once per second in kube-apiserver.log:

...
E1104 15:08:39.615226       1 dispatcher.go:170] failed calling webhook "service.labels.capsule.clastix.io": Post https://capsule-webhook-service.capsule-system.svc:443/mutate-v1-service-labels?timeout=10s: context canceled
I1104 15:08:46.235935       1 trace.go:116] Trace[985615564]: "Call mutating webhook" configuration:capsule-mutating-webhook-configuration,webhook:service.labels.capsule.clastix.io,resource:/v1, Resource=endpoints,subresource:,operation:UPDATE,UID:14443bcd-409d-4393-bd99-2f248345ddf8 (started: 2020-11-04 15:08:36.236768171 +0000 UTC m=+2109.282384989) (total time: 9.99911521s):
W1104 15:08:46.236003       1 dispatcher.go:169] Failed calling webhook, failing open service.labels.capsule.clastix.io: failed calling webhook "service.labels.capsule.clastix.io": Post https://capsule-webhook-service.capsule-system.svc:443/mutate-v1-service-labels?timeout=10s: context canceled
...

Additional context

MaxFedotov commented 4 years ago

Hello @d-m.

You need to set mutatingWebhooksTimeoutSeconds parameter in chart values to 10 seconds. That issue is related to an internal kubeapi timeout, and setting mutatingWebhooksTimeoutSeconds to less or equal 10 will fix it.

Btw, @bsctl @prometherion let's make 10 seconds as a default value? That's seems like the same issue, that i've got in https://github.com/clastix/capsule-helm-chart/issues/14

prometherion commented 4 years ago

@MaxFedotov honestly, I would start considering to handle the Service metadata through reconciliation loop, rather than webhook.

Do you recall which was the blocker?

MaxFedotov commented 4 years ago

@prometherion according to https://github.com/clastix/capsule/pull/84#pullrequestreview-485207090 we decided to implement it later. Seems like this time had come :)

MaxFedotov commented 4 years ago

@prometherion shall I create a new issue for moving services from webhook to controller or let's keep this work in this one?

prometherion commented 4 years ago

@MaxFedotov please, use this issue as the tracking one.

MaxFedotov commented 4 years ago

Hi @d-m,

We refactored code responsible for dealing with service labels\annotations and moved it from webhook to controller. Now this problem should not appear anymore.

d-m commented 4 years ago

@MaxFedotov thanks! I also shortened the timeout which has worked in the meantime.

MaxFedotov commented 4 years ago

@d-m oh, forgot to mention. This fix is available in https://github.com/clastix/capsule/releases/tag/v0.0.2