d-m commented 4 years ago

Bug description

When rotating my cluster master ASG, the capsule-mutating-webhook-configuration seemed to prevent new masters from becoming Ready when running kubectl get nodes. The master appeared functional otherwise. There were no logs in the capsule pods, however there were errors in the kube-apiserver.log, described below. Once I deleted capsule-mutating-webhook-configuration the kube-apiserver errors stopped and the masters showed as Ready.

My cluster is Kubernetes 1.18.10 deployed via Kops 1.8.2 to AWS.

How to reproduce

Steps to reproduce the behavior:

Install capsule using the helm chart with the following values:

manager:
image:
repository: quay.io/clastix/capsule
pullPolicy: IfNotPresent
tag: ''
options:
logLevel: '4'
forceTenantPrefix: true
resources:
limits:
  cpu: 200m
  memory: 128Mi
requests:
  cpu: 200m
  memory: 128Mi
proxy:
image:
repository: gcr.io/kubebuilder/kube-rbac-proxy
pullPolicy: IfNotPresent
tag: "v0.5.0"
resources:
limits:
  cpu: 100m
  memory: 128Mi
requests:
  cpu: 10m
  memory: 64Mi
mutatingWebhooksTimeoutSeconds: 30
imagePullSecrets: []
serviceAccount:
create: true
annotations: {}
name: "capsule"
podAnnotations: {}
priorityClassName: "system-cluster-critical"
nodeSelector:
node-role.kubernetes.io/master: ""
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/master
replicaCount: 3
affinity: {}
podSecurityPolicy:
enabled: true

Deploy tenant with the following manifest

apiVersion: capsule.clastix.io/v1alpha1
kind: Tenant
metadata:
labels:
annotations:
name: test
spec:
owner:
name: test-admin
kind: User
storageClasses:
allowedRegex: ".*"
allowed:
ingressClasses:
allowedRegex: ""
allowed:
- traefik
- default
namespaceQuota: 3
resourceQuotas:
- hard:
    limits.cpu: "8"
    limits.memory: 16Gi
    requests.cpu: "8"
    requests.memory: 16Gi
  scopes: ["NotTerminating"]
- hard:
    pods : "10"
    services: "5"
- spec:
    hard:
      requests.storage: "100Gi"
limitRanges:
- limits:
  - type: Pod
    min:
      cpu: "50m"
      memory: "5Mi"
    max:
      cpu: "1"
      memory: "1Gi"
  - type: Container
    defaultRequest:
      cpu: "100m"
      memory: "10Mi"
    default:
      cpu: "200m"
      memory: "100Mi"
    min:
      cpu: "50m"
      memory: "5Mi"
    max:
      cpu: "1"
      memory: "1Gi"
  - type: PersistentVolumeClaim
    min:
      storage: "1Gi"
    max:
      storage: "10Gi"
networkPolicies:
- policyTypes:
  - Ingress
  - Egress
  podSelector: {}
  ingress:
  - from:
    - namespaceSelector: {}
    - podSelector: {}
    - ipBlock:
        cidr: 192.168.0.0/16
  egress:
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 192.168.0.0/16

Rotate master ASGs

Expected behavior

New masters appear Ready when checked with kubectl get nodes.

Logs

The following error appears once per second in kube-apiserver.log:

...
E1104 15:08:39.615226       1 dispatcher.go:170] failed calling webhook "service.labels.capsule.clastix.io": Post https://capsule-webhook-service.capsule-system.svc:443/mutate-v1-service-labels?timeout=10s: context canceled
I1104 15:08:46.235935       1 trace.go:116] Trace[985615564]: "Call mutating webhook" configuration:capsule-mutating-webhook-configuration,webhook:service.labels.capsule.clastix.io,resource:/v1, Resource=endpoints,subresource:,operation:UPDATE,UID:14443bcd-409d-4393-bd99-2f248345ddf8 (started: 2020-11-04 15:08:36.236768171 +0000 UTC m=+2109.282384989) (total time: 9.99911521s):
W1104 15:08:46.236003       1 dispatcher.go:169] Failed calling webhook, failing open service.labels.capsule.clastix.io: failed calling webhook "service.labels.capsule.clastix.io": Post https://capsule-webhook-service.capsule-system.svc:443/mutate-v1-service-labels?timeout=10s: context canceled
...

Additional context

Capsule version: 0.0.1
Capsule Helm chart pinned to https://github.com/clastix/capsule-helm-chart/commit/73ce37cbc68d73b97719d665aa582ba485c860e9
Kubernetes version: 1.18.10
Kops: 1.8.2
Cloud: AWS

MaxFedotov commented 4 years ago

Hello @d-m.

You need to set mutatingWebhooksTimeoutSeconds parameter in chart values to 10 seconds. That issue is related to an internal kubeapi timeout, and setting mutatingWebhooksTimeoutSeconds to less or equal 10 will fix it.

Btw, @bsctl @prometherion let's make 10 seconds as a default value? That's seems like the same issue, that i've got in https://github.com/clastix/capsule-helm-chart/issues/14

prometherion commented 4 years ago

@MaxFedotov honestly, I would start considering to handle the Service metadata through reconciliation loop, rather than webhook.

Do you recall which was the blocker?

MaxFedotov commented 4 years ago

@prometherion according to https://github.com/clastix/capsule/pull/84#pullrequestreview-485207090 we decided to implement it later. Seems like this time had come :)

MaxFedotov commented 4 years ago

@prometherion shall I create a new issue for moving services from webhook to controller or let's keep this work in this one?

prometherion commented 4 years ago

@MaxFedotov please, use this issue as the tracking one.

MaxFedotov commented 4 years ago

Hi @d-m,

We refactored code responsible for dealing with service labels\annotations and moved it from webhook to controller. Now this problem should not appear anymore.

d-m commented 4 years ago

@MaxFedotov thanks! I also shortened the timeout which has worked in the meantime.

MaxFedotov commented 4 years ago

@d-m oh, forgot to mention. This fix is available in https://github.com/clastix/capsule/releases/tag/v0.0.2

projectcapsule / capsule

capsule-mutating-webhook-configuration prevents new masters from becoming ready #128

Bug description

How to reproduce

Expected behavior

Logs

Additional context