timja / jenkins-gh-issues-poc-06-18

0 stars 0 forks source link

[JENKINS-63766] Metaspace memory leak in Pipeline when running on Java 11 (but not 8 or 17) #1217

Closed timja closed 2 years ago

timja commented 4 years ago

Hello,

If the following pipeline script is run as a pipeline job in Jenkins, The instantiated class is not collected by the garbage collector:

class asdf {
  void run () {}
}

new asdf().run()

Even after a garbage collection is run using jcmd, the instantiated class is still there:

$ jcmd  GC.run

# wait for garbage collection to complete

$ jcmd  GC.class_stats | grep asdf
11804    13 0640  48     3344  13       470      2472     1904      5008      6912 asdf

With one class or a few classes, this issue might not lead to any problems. But our pipelines have a lot of classes and we're encountering a steady increase in the MetaSpace and OutOfMemoryExceptions (CompressedClassSpace) after every 2 to 3 weeks. Additionally the Metaspace increases by several GBs per week to about 10 GB in 2 to 3 weeks.

 

Some more information about the environment:


Originally reported by shoelzle, imported from: Metaspace memory leak in Pipeline when running on Java 11 (but not 8 or 17)
  • assignee: basil
  • status: Closed
  • priority: Blocker
  • resolution: Fixed
  • resolved: 2022-05-27T01:41:44+00:00
  • imported: 2022/01/10
timja commented 4 years ago

shoelzle:

With Jenkins 2.204.6, Oracle JDK 8 and the workflow-cps-plugin version 2.83 the problem does not occur.

timja commented 4 years ago

jochenafuerbacher:

I tested this with both official docker images: jenkins/jenkins:jdk11 and jenkins/jenkins:2.259-slim (jdk 8).

While the problem of shoelzle also occures on jenkins:jdk11 (with nothing else installed than workflow-job and workflow-cps), it does not occure on the jdk 8 image (jenkins:2.259-slim).

timja commented 4 years ago

jochenafuerbacher:

Same behavior with AdoptOpenJDK13 and Oracle JDK 11. Also changing the GC via -XX:+UseG1GC or -XX:+UseParallelGC didn't solve the problem.

timja commented 3 years ago

JIRAUSER131417:

Can some take care of this issue? Seems to be a serious bug when not using java 8.

Due to this issue we need to restart our jenksin master instance every couple of days.

timja commented 3 years ago

abdennour:

same issue . memory leak

 

timja commented 3 years ago

JIRAUSER131417:

same circumstances

timja commented 3 years ago

abdennour:

Just quick update. The root cause of memory leak was :

The issue was fixed by

timja commented 3 years ago

kredens:

abdennour which version contains the fixes?

timja commented 3 years ago

shoelzle:

Hallo abdennour,
I don't see how one of your root causes might fit to the example given in the description.
Could you try the example from the description in order to verify that your fixes apply to this issue.

timja commented 2 years ago

basil:

CC jglick markewaite. This could potentially be a blocker.

timja commented 2 years ago

jglick:

Tried to reproduce using MemoryAssert.assertGC without success. If reproducible, someone would need to look at a heap dump and see what is actually being held, and why https://github.com/jenkinsci/workflow-cps-plugin/blob/52d3de6044d0d3373ce513a5ff2f1de15d572d5d/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java#L1281-L1436 does not take care of it.

timja commented 2 years ago

basil:

I did a clean installation of Jenkins with nothing but the suggested plugins and ran the Pipeline script provided by the reporter. I confirmed (by increasing logging) that https://github.com/jenkinsci/workflow-cps-plugin/blob/52d3de6044d0d3373ce513a5ff2f1de15d572d5d/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java#L1281-L1436 found the asdf class as expected. I then took a heap dump and searched for the asdf class in the heap dump. I found asdf in the list of objects, with a count of 0 and a retained size of 0 bytes. Based on the above, I conclude the original issue as submitted by the reporter is no longer valid. I am closing this bug as Cannot Reproduce.

timja commented 2 years ago

shoelzle:

basil,
Please note that the Metaspace is not part of the Java Heap Memory.
And It's not the Java Heap that grows from that Memory leak, but the Metaspace, more specifically the CompressedClassSpace part of the Metaspace.
So it's plausible for me that you don't see any space blocked in the Heap Memory.

The issue ist still present in Jenkins 2.344 run in a Java 11 JVM.

timja commented 2 years ago

basil:

Thank you shoelzle for the clarification. I can now reproduce the problem. I created a new Pipeline job with

class asdf {
  void run () {}
}

new asdf().run()

When running the above Pipeline under Java 8 with -verbose:class:

[Loaded asdf from file:/groovy/shell]
[Loaded WorkflowScript from file:/groovy/shell]
[…] // jcmd ${PID} GC.run
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor33 0x00000007c0878428]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor32 0x00000007c0878028]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor31 0x00000007c0877c28]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor30 0x00000007c0877828]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor29 0x00000007c0877428]
[Unloading class sun.reflect.GeneratedSerializationConstructorAccessor28 0x00000007c0877028]
[Unloading class WorkflowScript 0x00000007c084b290]
[Unloading class asdf 0x00000007c084b028]
[Unloading class org.eclipse.jetty.server.session.DefaultSessionCache$$Lambda$302/1809148827 0x00000007c07ec828]
[Loaded sun.reflect.GeneratedConstructorAccessor84 from __JVM_DefineClass__]

When running the same Pipeline under Java 11 with -verbose:class:

[10.332s][info][class,load  ] asdf source: file:/groovy/shell
[10.337s][info][class,load  ] WorkflowScript source: file:/groovy/shell
[…] // jcmd ${PID} GC.run
[33.509s][info][class,load  ] jdk.internal.reflect.GeneratedConstructorAccessor81 source: __JVM_DefineClass__
[38.551s][info][class,load  ] jdk.internal.reflect.GeneratedConstructorAccessor82 source: __JVM_DefineClass__
[38.552s][info][class,load  ] jdk.internal.reflect.GeneratedMethodAccessor27 source: __JVM_DefineClass__
[38.553s][info][class,load  ] jdk.internal.reflect.GeneratedMethodAccessor28 source: __JVM_DefineClass__

It can clearly be seen that the asdf class is loaded in both cases but is only unloaded from the Metaspace in Java 8.

timja commented 2 years ago

basil:

The problem appears to be resolved when running under Java 17:

[38.700s][info][class,load] asdf source: file:/groovy/shell
[38.702s][info][class,load] WorkflowScript source: file:/groovy/shell
[…] // jcmd ${PID} GC.run
[63.316s][info][class,load] jdk.internal.reflect.GeneratedMethodAccessor76 source: __ClassDefiner__
[63.317s][info][class,load] jdk.internal.reflect.GeneratedMethodAccessor77 source: __ClassDefiner__
[64.378s][info][class,unload] unloading class jdk.internal.reflect.GeneratedSerializationConstructorAccessor33 0x00000008013b4400
[64.378s][info][class,unload] unloading class jdk.internal.reflect.GeneratedSerializationConstructorAccessor32 0x00000008013b4000
[64.378s][info][class,unload] unloading class jdk.internal.reflect.GeneratedSerializationConstructorAccessor31 0x00000008013b3c00
[64.378s][info][class,unload] unloading class jdk.internal.reflect.GeneratedSerializationConstructorAccessor30 0x00000008013b3800
[64.378s][info][class,unload] unloading class jdk.internal.reflect.GeneratedSerializationConstructorAccessor29 0x00000008013b3400
[64.378s][info][class,unload] unloading class jdk.internal.reflect.GeneratedSerializationConstructorAccessor28 0x00000008013b3000
[64.378s][info][class,unload] unloading class WorkflowScript 0x0000000801394280
[64.378s][info][class,unload] unloading class asdf 0x0000000801394000
timja commented 2 years ago

basil:

I ran the above test 10 times (all on Linux x86_64) and found the following results:

Temurin 1.8.0_332-b09 - PASS
AdoptOpenJDK 9.0.4+11 - FAIL
AdoptOpenJDK 10.0.2+13 - FAIL
Temurin 11.0.15+10 - FAIL
AdoptOpenJDK 12.0.2+10 - FAIL
AdoptOpenJDK 13.0.2+8 - FAIL
AdoptOpenJDK 14.0.2+12 - FAIL
AdoptOpenJDK 15.0.2+7 - FAIL
Temurin 16.0.2+7 - PASS
Temurin 17.0.3+7 - PASS

So Java versions 9 through 15 are affected.

timja commented 2 years ago

basil:

The last failing commit is openjdk/jdk@42a6eadb853. The first passing commit is openjdk/jdk@2ee2b4ae19d. I confirmed that backporting openjdk/jdk@2ee2b4ae19d to jdk11u-dev resolves the problem on Java 11.

timja commented 2 years ago

basil:

The upstream bug is JDK-8231454. As described in the upstream bug, a memory memory leak can occur when java.beans.Introspector#getBeanInfo is invoked against a class, even if java.beans.Introspector#flushFromCaches is later called on that class.

In our case, java.beans.Introspector#getBeanInfo is invoked from the following stack trace:

getBeanInfo:196, Introspector (java.beans)
run:3328, MetaClassImpl$15 (groovy.lang)
doPrivileged:-1, AccessController (java.security)
addProperties:3326, MetaClassImpl (groovy.lang)
initialize:3303, MetaClassImpl (groovy.lang)
getMetaClassUnderLock:289, ClassInfo (org.codehaus.groovy.reflection)
getMetaClass:331, ClassInfo (org.codehaus.groovy.reflection)
getMetaClass:277, MetaClassRegistryImpl (org.codehaus.groovy.runtime.metaclass)
getMetaClass:905, InvokerHelper (org.codehaus.groovy.runtime)
createCallConstructorSite:86, CallSiteArray (org.codehaus.groovy.runtime.callsite)
defaultCallConstructor:59, CallSiteArray (org.codehaus.groovy.runtime.callsite)
callConstructor:238, AbstractCallSite (org.codehaus.groovy.runtime.callsite)
call:208, Checker$3 (org.kohsuke.groovy.sandbox.impl)
onNewInstance:42, GroovyInterceptor (org.kohsuke.groovy.sandbox)
onNewInstance:173, SandboxInterceptor (org.jenkinsci.plugins.scriptsecurity.sandbox.groovy)
call:205, Checker$3 (org.kohsuke.groovy.sandbox.impl)
checkedConstructor:210, Checker (org.kohsuke.groovy.sandbox.impl)
constructorCall:21, SandboxInvoker (com.cloudbees.groovy.cps.sandbox)
dispatchOrArg:97, FunctionCallBlock$ContinuationImpl (com.cloudbees.groovy.cps.impl)
fixName:78, FunctionCallBlock$ContinuationImpl (com.cloudbees.groovy.cps.impl)
invoke0:-1, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:62, NativeMethodAccessorImpl (jdk.internal.reflect)
invoke:43, DelegatingMethodAccessorImpl (jdk.internal.reflect)
invoke:566, Method (java.lang.reflect)
receive:72, ContinuationPtr$ContinuationImpl (com.cloudbees.groovy.cps.impl)
eval:21, ConstantBlock (com.cloudbees.groovy.cps.impl)
step:83, Next (com.cloudbees.groovy.cps)
call:174, Continuable$1 (com.cloudbees.groovy.cps)
call:163, Continuable$1 (com.cloudbees.groovy.cps)
use:136, GroovyCategorySupport$ThreadCategoryInfo (org.codehaus.groovy.runtime)
use:275, GroovyCategorySupport (org.codehaus.groovy.runtime)
run0:163, Continuable (com.cloudbees.groovy.cps)
access$001:18, SandboxContinuable (org.jenkinsci.plugins.workflow.cps)
run0:51, SandboxContinuable (org.jenkinsci.plugins.workflow.cps)
runNextChunk:187, CpsThread (org.jenkinsci.plugins.workflow.cps)
run:420, CpsThreadGroup (org.jenkinsci.plugins.workflow.cps)
access$400:95, CpsThreadGroup (org.jenkinsci.plugins.workflow.cps)
call:330, CpsThreadGroup$2 (org.jenkinsci.plugins.workflow.cps)
call:294, CpsThreadGroup$2 (org.jenkinsci.plugins.workflow.cps)
call:67, CpsVmExecutorService$2 (org.jenkinsci.plugins.workflow.cps)
run:264, FutureTask (java.util.concurrent)
run:139, SingleLaneExecutorService$1 (hudson.remoting)
run:28, ContextResettingExecutorService$1 (jenkins.util)
run:68, ImpersonatingExecutorService$1 (jenkins.security)
call:515, Executors$RunnableAdapter (java.util.concurrent)
run:264, FutureTask (java.util.concurrent)
runWorker:1128, ThreadPoolExecutor (java.util.concurrent)
run:628, ThreadPoolExecutor$Worker (java.util.concurrent)
run:829, Thread (java.lang)

java.beans.Introspector#flushFromCaches is later called on that class, but to no avail:

cleanUpLoader:1300, CpsFlowExecution (org.jenkinsci.plugins.workflow.cps)
cleanUpHeap:1269, CpsFlowExecution (org.jenkinsci.plugins.workflow.cps)
run:464, CpsThreadGroup (org.jenkinsci.plugins.workflow.cps)
access$400:95, CpsThreadGroup (org.jenkinsci.plugins.workflow.cps)
call:330, CpsThreadGroup$2 (org.jenkinsci.plugins.workflow.cps)
call:294, CpsThreadGroup$2 (org.jenkinsci.plugins.workflow.cps)
call:67, CpsVmExecutorService$2 (org.jenkinsci.plugins.workflow.cps)
run:264, FutureTask (java.util.concurrent)
run:139, SingleLaneExecutorService$1 (hudson.remoting)
run:28, ContextResettingExecutorService$1 (jenkins.util)
run:68, ImpersonatingExecutorService$1 (jenkins.security)
call:515, Executors$RunnableAdapter (java.util.concurrent)
run:264, FutureTask (java.util.concurrent)
runWorker:1128, ThreadPoolExecutor (java.util.concurrent)
run:628, ThreadPoolExecutor$Worker (java.util.concurrent)
run:829, Thread (java.lang)
timja commented 2 years ago

basil:

Short-term solution

jenkinsci/workflow-cps-plugin#543 has been released in 2705.v0449852ee36f and jenkinsci/script-security-plugin#420 has been released in 1175.v4b_d517d6db_f0, but upgrading to these releases is not sufficient: you also need jenkinsci/jenkins#6597 (planned for eventual release in weeklies and backport to LTS) or to start Jenkins with --add-opens java.desktop/com.sun.beans.introspect=ALL-UNNAMED in your Java options.

jochenafuerbacher nsewardep phwolf shoelzle Can you please test the short-term solution and confirm that the issue has been resolved? Thanks!

Long-term solution

openjdk/jdk11u-dev#1106 and openjdk/jdk11u-dev#1103 have been proposed to backport the relevant upstream commits from OpenJDK 17 to OpenJDK 11.

timja commented 2 years ago

[Originally duplicated by: JENKINS-66491]

timja commented 2 years ago

[Epic: JENKINS-67688]