projectcapsule / capsule

Multi-tenancy and policy-based framework for Kubernetes.
https://capsule.clastix.io
Apache License 2.0
1.51k stars 150 forks source link

Capsule pods occasionally become unresponsive #143

Closed d-m closed 3 years ago

d-m commented 3 years ago

Bug description

Occasionally the Capsule pods will become unresponsive which prevents the creation of resources due to the web hook configurations. Deleting the Capsule pods and allowing the deployment to recreate them fixes the issue.

Our cluster does not have any Tenant resources deployed.

How to reproduce

Unfortunately it seems difficult to reproduce this issue. I generally find it when deployments (via Helm, kubectl, or Terraform) fail due to the web hook.

The cluster currently has no Tenant resources deployed. Capsule itself was installed using the helm chart with the following manifest:

manager:
  options:
    logLevel: "4"
    forceTenantPrefix: true
  resources:
    limits:
      cpu: 200m
      memory: 128Mi
    requests:
      cpu: 200m
      memory: 128Mi
proxy:
  resources:
    limits:
      cpu: 100m
      memory: 128Mi
    requests:
      cpu: 10m
      memory: 64Mi
mutatingWebhooksTimeoutSeconds: 10
imagePullSecrets: []
serviceAccount:
  create: true
  annotations: {}
  name: "capsule"
podAnnotations: {}
priorityClassName: "system-cluster-critical"
nodeSelector:
  node-role.kubernetes.io/master: ""
tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
replicaCount: 3
affinity: {}
podSecurityPolicy:
  enabled: true

Expected behavior

Kubernetes resources should be deployed successfully.

Logs

Before deleting capsule pods

Logs from Terraform deployment:

Error: Internal error occurred: failed calling webhook "owner.namespace.capsule.clastix.io": Post https://capsule-webhook-service.capsule-system.svc:443/mutate-v1-namespace-owner-reference?timeout=10s: dial tcp 100.67.156.14:443: connect: connection refused

Logs from curling Capsule service after port-forwarding:

$ kubectl port-forward -n capsule-system svc/capsule-webhook-service 9000:443 

Forwarding from 127.0.0.1:9000 -> 9443
Forwarding from [::1]:9000 -> 9443
Handling connection for 9000
E1118 10:15:38.146961   82444 portforward.go:400] an error occurred forwarding 9000 -> 9443: error forwarding port 9443 to pod c2feb15cef20494dea08213093a2d9dd44b72276e3c64db1094cafde831ea9d3, uid : exit status 1: 2020/11/18 15:15:38 socat[6408] E connect(5, AF=2 127.0.0.1:9443, 16): Connection refused
Handling connection for 9000

Logs from curl command:

$ curl -vk https://localhost:9000/mutate-v1-namespace-owner-reference

*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9000 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:9000 
* Closing connection 0
curl: (35) LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to localhost:9000 

Capsule manager logs:

$ kubectl logs -l app.kubernetes.io/name=capsule -n capsule-system -c manager

{"level":"info","ts":1605648943.3741796,"logger":"controllers.CA","msg":"Reconciling CA Secret","Request.Namespace":"capsule-system","Request.Name":"capsule-ca"}
{"level":"info","ts":1605648943.3744116,"logger":"controller.clusterrolebinding","msg":"Starting workers","reconcilerGroup":"rbac.authorization.k8s.io","reconcilerKind":"ClusterRoleBinding","worker count":1}
{"level":"info","ts":1605648943.3746004,"logger":"controllers.CA","msg":"Handling CA Secret","Request.Namespace":"capsule-system","Request.Name":"capsule-ca"}
{"level":"info","ts":1605648943.3746245,"logger":"controllers.CA","msg":"Updating CA secret with new PEM and RSA","Request.Namespace":"capsule-system","Request.Name":"capsule-ca"}
{"level":"info","ts":1605648943.3749886,"logger":"controller.endpointslice","msg":"Starting workers","reconcilerGroup":"discovery.k8s.io","reconcilerKind":"EndpointSlice","worker count":1}
{"level":"info","ts":1605648943.3751705,"logger":"controller.clusterrole","msg":"Starting workers","reconcilerGroup":"rbac.authorization.k8s.io","reconcilerKind":"ClusterRole","worker count":1}
{"level":"info","ts":1605648943.3765306,"logger":"controller.service","msg":"Starting workers","reconcilerGroup":"","reconcilerKind":"Service","worker count":1}
{"level":"info","ts":1605648943.3768785,"logger":"controller.endpoints","msg":"Starting workers","reconcilerGroup":"","reconcilerKind":"Endpoints","worker count":1}
{"level":"info","ts":1605648943.377297,"logger":"controller.tenant","msg":"Starting workers","reconcilerGroup":"capsule.clastix.io","reconcilerKind":"Tenant","worker count":1}
{"level":"info","ts":1605648943.5868814,"logger":"controllers.CA","msg":"Reconciliation completed, processing back in 87647h58m36s","Request.Namespace":"capsule-system","Request.Name":"capsule-ca"}

After deleting and recreating Capsule pods

Logs from curling Capsule service after port-forwarding:

$ kubectl port-forward -n capsule-system svc/capsule-webhook-service 9000:443 

Forwarding from 127.0.0.1:9000 -> 9443
Forwarding from [::1]:9000 -> 9443
Handling connection for 9000

Logs from curl command:

$ curl -vk https://localhost:9000/mutate-v1-namespace-owner-reference

*   Trying ::1...
* TCP_NODELAY set
* Connected to localhost (::1) port 9000 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=UK; ST=; L=London; street=27, Old Gloucester Street; postalCode=WC1N 3AX; O=Clastix
*  start date: Nov 15 15:45:23 2020 GMT
*  expire date: May 15 15:45:05 2021 GMT
*  issuer: C=UK; ST=; L=London; street=27, Old Gloucester Street; postalCode=WC1N 3AX; O=Clastix
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x7fef35809600)
> GET /mutate-v1-namespace-owner-reference HTTP/2
> Host: localhost:9000
> User-Agent: curl/7.64.1
> Accept: */*
> 
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!
< HTTP/2 200 
< content-type: text/plain; charset=utf-8
< content-length: 128
< date: Wed, 18 Nov 2020 15:25:47 GMT
< 
{"response":{"uid":"","allowed":false,"status":{"metadata":{},"message":"contentType=, expected application/json","code":400}}}
* Connection #0 to host localhost left intact
* Closing connection 0

Capsule manager logs:

$ kubectl logs -l app.kubernetes.io/name=capsule -n capsule-system -c manager

{"level":"debug","ts":1605658060.3148696,"logger":"controller-runtime.webhook.webhooks","msg":"received request","webhook":"/validating-v1-namespace-tenant-prefix","UID":"15080ec0-4199-40b3-b020-39fd4882699b","kind":"/v1, Kind=Namespace","resource":{"group":"","version":"v1","resource":"namespaces"}}
{"level":"debug","ts":1605658060.3149276,"logger":"controller-runtime.webhook.webhooks","msg":"wrote response","webhook":"/validating-v1-namespace-tenant-prefix","code":200,"reason":"","UID":"15080ec0-4199-40b3-b020-39fd4882699b","allowed":true}
{"level":"debug","ts":1605658060.315012,"logger":"controller-runtime.webhook.webhooks","msg":"received request","webhook":"/validate-v1-namespace-quota","UID":"1b17ad59-d2e6-4a12-8589-e401bd6c51b3","kind":"/v1, Kind=Namespace","resource":{"group":"","version":"v1","resource":"namespaces"}}
{"level":"debug","ts":1605658060.3150485,"logger":"controller-runtime.webhook.webhooks","msg":"wrote response","webhook":"/validate-v1-namespace-quota","code":200,"reason":"","UID":"1b17ad59-d2e6-4a12-8589-e401bd6c51b3","allowed":true}
{"level":"debug","ts":1605658060.4641361,"logger":"controller-runtime.webhook.webhooks","msg":"received request","webhook":"/validating-v1-pvc","UID":"7e5f4df2-16e1-4b6c-93f3-7d699af6b770","kind":"/v1, Kind=PersistentVolumeClaim","resource":{"group":"","version":"v1","resource":"persistentvolumeclaims"}}
{"level":"debug","ts":1605658060.4643464,"logger":"controller-runtime.webhook.webhooks","msg":"wrote response","webhook":"/validating-v1-pvc","code":200,"reason":"","UID":"7e5f4df2-16e1-4b6c-93f3-7d699af6b770","allowed":true}
{"level":"debug","ts":1605658060.659953,"logger":"controller-runtime.webhook.webhooks","msg":"received request","webhook":"/validating-ingress","UID":"f6d2580a-3964-42ba-80f9-b56e1d0d82be","kind":"networking.k8s.io/v1beta1, Kind=Ingress","resource":{"group":"networking.k8s.io","version":"v1beta1","resource":"ingresses"}}
{"level":"debug","ts":1605658060.6603382,"logger":"controller-runtime.webhook.webhooks","msg":"wrote response","webhook":"/validating-ingress","code":200,"reason":"","UID":"f6d2580a-3964-42ba-80f9-b56e1d0d82be","allowed":true}
{"level":"debug","ts":1605658112.5750766,"logger":"controller-runtime.webhook.webhooks","msg":"received request","webhook":"/validating-ingress","UID":"4af8e034-aecd-4e44-9143-712ff1071110","kind":"extensions/v1beta1, Kind=Ingress","resource":{"group":"extensions","version":"v1beta1","resource":"ingresses"}}
{"level":"debug","ts":1605658112.5760896,"logger":"controller-runtime.webhook.webhooks","msg":"wrote response","webhook":"/validating-ingress","code":200,"reason":"","UID":"4af8e034-aecd-4e44-9143-712ff1071110","allowed":true}

Additional context

prometherion commented 3 years ago

@d-m first of all, thanks for the detailed bug report: really appreciated, we're looking forward to fixing this issue and making Capsule more robust!

I'd like to ask you if you can increase the verbosity of the logs using the --zap-devel flag that will enable the development logging feature, providing further detailed stack traces.

What's weird is that the Pod is not being killed by the readiness/liveness probes, I'm expecting these have been set but better double-check this.

From a resource point of view, do you see an increase in memory consumption or abnormal CPU usage? Having a Grafana dashboard output would be great to see if we're getting an OOM or something messing up with the CPU.

@bsctl and @gdurifw are also running Capsule in a production environment: do you have any other insight to share to debug this?

d-m commented 3 years ago
Screen Shot 2020-11-19 at 9 23 56 AM

Here's a screenshot of the dashboard from where we recreated the pods. I'll increase the log level and see what we find. Capsule has been pretty stable for the last 24 hours after recreating the pods, though.

prometherion commented 3 years ago

Thanks for sharing, everything seems ok at first glance.

Please, next time try to ping the /readyz and /healthz endpoint probes rather than the webhook TLS terminated port: we're expecting also a detailed report on logs about what's not working there.

d-m commented 3 years ago

Sounds good. I enabled the --zap-devel flag and will keep an eye on it and check those probes as well if it happens again. I should note that we have four clusters provisioned identically and noticed this issue on each one. Capsule has been stable on each cluster after restarting the pods yesterday.

prometherion commented 3 years ago

Hey @d-m, any news about this issue?

d-m commented 3 years ago

No it seems to have been stable after rebooting. Feel free to close and now that I have debug logs enabled I'll reply with more info if it happens again.