Closed d-m closed 3 years ago
@d-m first of all, thanks for the detailed bug report: really appreciated, we're looking forward to fixing this issue and making Capsule more robust!
I'd like to ask you if you can increase the verbosity of the logs using the --zap-devel
flag that will enable the development logging feature, providing further detailed stack traces.
What's weird is that the Pod is not being killed by the readiness/liveness probes, I'm expecting these have been set but better double-check this.
From a resource point of view, do you see an increase in memory consumption or abnormal CPU usage? Having a Grafana dashboard output would be great to see if we're getting an OOM or something messing up with the CPU.
@bsctl and @gdurifw are also running Capsule in a production environment: do you have any other insight to share to debug this?
Here's a screenshot of the dashboard from where we recreated the pods. I'll increase the log level and see what we find. Capsule has been pretty stable for the last 24 hours after recreating the pods, though.
Thanks for sharing, everything seems ok at first glance.
Please, next time try to ping the /readyz
and /healthz
endpoint probes rather than the webhook TLS terminated port: we're expecting also a detailed report on logs about what's not working there.
Sounds good. I enabled the --zap-devel
flag and will keep an eye on it and check those probes as well if it happens again. I should note that we have four clusters provisioned identically and noticed this issue on each one. Capsule has been stable on each cluster after restarting the pods yesterday.
Hey @d-m, any news about this issue?
No it seems to have been stable after rebooting. Feel free to close and now that I have debug logs enabled I'll reply with more info if it happens again.
Bug description
Occasionally the Capsule pods will become unresponsive which prevents the creation of resources due to the web hook configurations. Deleting the Capsule pods and allowing the deployment to recreate them fixes the issue.
Our cluster does not have any Tenant resources deployed.
How to reproduce
Unfortunately it seems difficult to reproduce this issue. I generally find it when deployments (via Helm, kubectl, or Terraform) fail due to the web hook.
The cluster currently has no Tenant resources deployed. Capsule itself was installed using the helm chart with the following manifest:
Expected behavior
Kubernetes resources should be deployed successfully.
Logs
Before deleting capsule pods
Logs from Terraform deployment:
Logs from curling Capsule service after port-forwarding:
Logs from curl command:
Capsule manager logs:
After deleting and recreating Capsule pods
Logs from curling Capsule service after port-forwarding:
Logs from curl command:
Capsule manager logs:
Additional context