jenkins-infra / helpdesk

Open your Infrastructure related issues here for the Jenkins project
https://github.com/jenkins-infra/helpdesk/issues/new/choose
17 stars 10 forks source link

plugin site backend OOM killed #3774

Closed smerle33 closed 1 year ago

smerle33 commented 1 year ago

Service(s)

plugins.jenkins.io

Summary

while checking on deployments on our publick8s kubernetes instance, I noticed that the plugin-site backend pod was restarted for OOMKilled:

│ NAMESPACE↑                  NAME                                                                 PF     READY         RESTARTS STATUS           CPU      MEM     %CPU/R     %CPU/L     %MEM/R     %MEM/L IP                NODE                                     AGE          │
│ plugin-site                 plugin-site-backend-7fcb4c77c8-z294l                                 ●      1/1                277 Running            9     1940          1          0        189         94 10.100.13.24      aks-x86medium-20522204-vmss00000i        8d           │
│ plugin-site                 plugin-site-frontend-59bfb957c4-289cl                                ●      1/1                  0 Running            2       12          2          2         38         38 10.100.4.30       aks-x86medium-20522204-vmss000005        12d          │
│ plugin-site                 plugin-site-frontend-59bfb957c4-l6gl2                                ●      1/1                  0 Running            2       26          2          2         82         82 10.100.13.17      aks-x86medium-20522204-vmss00000i        8d           │
│ plugin-site                 plugin-site-issues-54f968bd64-7cszz                                  ●      1/1                  0 Running           20      123        n/a        n/a        n/a        n/a 10.100.13.18      aks-x86medium-20522204-vmss00000i        8d   

277 restart as for now.

Reproduction steps

No response

github-actions[bot] commented 1 year ago

Take a look at these similar issues to see if there isn't already a response to your problem:

  1. 72% #3696
smerle33 commented 1 year ago

TLTR : incompatibility of the application image with cgroup v2 and the problem has started after the kubernetes upgrade : Cluster was recently upgraded to Kubernetes 1.25


First step is to check on datadog and if we go back in time for 4 month, we can see that the memory issues and pod restart for the plugin-site-backend, are matching a specific date beginning of june (around june 8th) :

Capture d’écran 2023-10-06 à 08 55 06 Capture d’écran 2023-10-06 à 08 55 23

to confirm we can check in azure, within Resource health from the publick8s cluster in diagnose and solve problems and then node health :

Capture d’écran 2023-10-06 à 09 06 21 Capture d’écran 2023-10-06 à 09 06 33

and the solution is explained here

Capture d’écran 2023-10-06 à 09 06 41

So we need to work on the image underlying the plugin-site-backend : https://github.com/jenkins-infra/plugin-site-api to make sure it use a patched version of jdk that is compliant with cgroup v2 (https://github.com/jenkins-infra/plugin-site-api/blob/416279518ff3444904c28b1ef3aa56ca3ff7d38b/Dockerfile#L1) with a parent image like jetty:9-jdk8 or more specific like jetty:9.4.52-jdk8-eclipse-temurin

as a side work, we may also want to move the build of this image from trusted.ci to infra.ci

smerle33 commented 1 year ago

locked by https://github.com/jenkins-infra/helpdesk/issues/3778

dduportal commented 1 year ago

https://github.com/jenkins-infra/helpdesk/issues/3778 is fixed. The new plugin-site-api container image deployed is working as expected and keeps being OOM-killed: we need to deploy the latest version with the changes from https://github.com/jenkins-infra/plugin-site-api/pull/119

dduportal commented 1 year ago

Helm chart update with a successfull test of the new memory limit finally enforced: https://github.com/jenkins-infra/helm-charts/pull/855

dduportal commented 1 year ago

https://github.com/jenkins-infra/kubernetes-management/pull/4527 deployed the new image to production. No service outage and the new pod seems to use the expected amount of memory:

Capture d’écran 2023-10-11 à 17 23 11