Closed smerle33 closed 1 year ago
TLTR : incompatibility of the application image with cgroup v2
and the problem has started after the kubernetes upgrade : Cluster was recently upgraded to Kubernetes 1.25
First step is to check on datadog and if we go back in time for 4 month, we can see that the memory issues and pod restart for the plugin-site-backend, are matching a specific date beginning of june (around june 8th) :
to confirm we can check in azure, within Resource health
from the publick8s cluster in diagnose and solve problems
and then node health :
and the solution is explained here
So we need to work on the image underlying the plugin-site-backend : https://github.com/jenkins-infra/plugin-site-api
to make sure it use a patched version of jdk that is compliant with cgroup v2 (https://github.com/jenkins-infra/plugin-site-api/blob/416279518ff3444904c28b1ef3aa56ca3ff7d38b/Dockerfile#L1) with a parent image like jetty:9-jdk8
or more specific like jetty:9.4.52-jdk8-eclipse-temurin
as a side work, we may also want to move the build of this image from trusted.ci to infra.ci
https://github.com/jenkins-infra/helpdesk/issues/3778 is fixed. The new plugin-site-api container image deployed is working as expected and keeps being OOM-killed: we need to deploy the latest version with the changes from https://github.com/jenkins-infra/plugin-site-api/pull/119
Helm chart update with a successfull test of the new memory limit finally enforced: https://github.com/jenkins-infra/helm-charts/pull/855
https://github.com/jenkins-infra/kubernetes-management/pull/4527 deployed the new image to production. No service outage and the new pod seems to use the expected amount of memory:
Service(s)
plugins.jenkins.io
Summary
while checking on deployments on our publick8s kubernetes instance, I noticed that the plugin-site backend pod was restarted for OOMKilled:
277 restart as for now.
Reproduction steps
No response