jakartaee / jaxb-tck

Other
4 stars 16 forks source link

Seeing OOM failures in agent process during JAXB test run #4

Closed scottmarlow closed 5 years ago

scottmarlow commented 5 years ago

I'm seeing OOM failures in agent process. I'll investigate further, wondering if we are leaking com.sun.jaxb_tck.lib.DirsClassLoader instances.

The first OOM that I see is:

Java.lang.OutOfMemoryError: Compressed class space at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:763) at java.lang.ClassLoader.defineClass(ClassLoader.java:642) at com.sun.jaxb_tck.lib.DirsClassLoader.findClass(DirsClassLoader.java:115) at com.sun.jaxb_tck.lib.DirsClassLoader.loadClass(DirsClassLoader.java:83) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) at com.sun.jaxb_tck.lib.ExecTestSameJVMCmd.run(ExecTestSameJVMCmd.java:82) at com.sun.javatest.agent.Agent$Task$CommandExecutor$2.run(Agent.java:899) at java.lang.Thread.run(Thread.java:748) Exception in thread "Agent0" java.lang.NullPointerException at com.sun.javatest.agent.Agent$Task.sendStatus(Agent.java:1044) at com.sun.javatest.agent.Agent$Task.handleRequest(Agent.java:661) at com.sun.javatest.agent.Agent.handleRequestsUntilClosed(Agent.java:417) at com.sun.javatest.agent.Agent.access$000(Agent.java:61) at com.sun.javatest.agent.Agent$1.run(Agent.java:283) at java.lang.Thread.run(Thread.java:748)

scottmarlow commented 5 years ago

Here is how I am starting the java agent + tests:

java -Xmx4096m -Xms128m -XX:+AggressiveOpts -XX:+UseBiasedLocking -XX:+UseFastAccessorMethods -Djava.endorsed.dirs=$TCK_HOME/endorsed \ -classpath $TCK_HOME/lib/javatest.jar:$TCK_HOME/lib/jtlegacy.jar:$TCK_HOME/classes:$TCK_HOME/endorsed/jaxb-impl.jar:$TCK_HOME/classes:$TCK_HOME/client/jaxb-jxc.jar:$TCK_HOME/client/jaxb-xjc.jar:$TCK_HOME/endorsed/jboss-jaxb-api_2.3_spec.jar:$TCK_HOME/endorsed/relaxngDatatype.jar:$TCK_HOME/endorsed/istack-commons-runtime.jar:$TCK_HOME/endorsed/istack-commons-tools.jar:$TCK_HOME/endorsed/txw2.jar:$TCK_HOME/client/codemodel.jar:$TCK_HOME/client/xsom.jar:$TCK_HOME/client/rngom.jar \ -Djava.security.policy=$TCK_HOME/lib/tck.policy \ com.sun.javatest.agent.AgentMain \ -passive 1>agent.log 2>agent-err.log &

java -Xmx512m -Xms128m -Djava.endorsed.dirs=$TCK_HOME/endorsed -jar $TCK_HOME/lib/javatest.jar \ -verbose:stop,progress -testSuite $TCK_HOME \ -workdir -create $WORK_DIR/work_directory \ -config ./default_configuration.jti \ -concurrency 2 -timeoutFactor 5 \ -runtests

Any suggestions?

scottmarlow commented 5 years ago

Seems that every call in is on a different thread, which is why we cache several classloaders until we run out of memory:

"CommandExecutor executeThread for command args: [-loadDirs, /home/smarlow/work/jakarta/runjaxbtck/JAXB-TCK-2.3/classes:/home/smarlow/work/jakarta/runjaxbtck/work_directory/classes/bind/binder, javasoft.sqe.tests.api.javax.xml.bind.JAXBValidationCheckerTest, -TestURL, file:/home/smarlow/work/jakarta/runjaxbtck/JAXB-TCK-2.3/tests/api/javax_xml/bind/Binder/Binder.xsd.html#Binder.xml, -xml, Binder.xml, -package, javasoft.sqe.tests.bind.binder, -schema, Binder.xsd]@3595" prio=3 tid=0x148 nid=NA runnable java.lang.Thread.State: RUNNABLE at com.sun.jaxb_tck.lib.DirsClassLoader.newInstance(DirsClassLoader.java:60)

  • locked <0x28d> (a java.lang.Class) at com.sun.jaxb_tck.lib.ExecTestSameJVMCmd.run(ExecTestSameJVMCmd.java:62) at com.sun.javatest.agent.Agent$Task$CommandExecutor$2.run(Agent.java:899) at java.lang.Thread.run(Thread.java:748)
scottmarlow commented 5 years ago

For my current test running, it might be a better choice to cache the classloader based on thread name, instead of thread instance, since the thread name never changes (it seems) but the test harness seems to create a new thread for each test.

I wonder if this is due to a test harness change.

scottmarlow commented 5 years ago

Based on my previous comment, since the "DirsClassLoader.threads" cache contents is never reused, we may as well drop the "DirsClassLoader.threads" cache, to avoid the OOM. I'll try a code change for that.

scottmarlow commented 5 years ago

So far, the change is looking good for my local test run (Pass: 2,563 Fail: 0 Error: 0 Not-Run: 22,063)...

Also, will share a suggestion from Brian Stansberry:

If for some reason the caching is actually needed, converting it to a ThreadLocal<WeakReference> might work. If a strong ref cache is needed

edbratt commented 5 years ago

If you haven't already looked, you can use the Jenkins CI as a reference. The most recent run (April. 23) shows 100% success, no failures, no skips: https://ci.eclipse.org/jakartaee-tck/job/jaxb-tck/job/master/24/testReport/ The console output is awfully long, but it may provide some reference data for you to compare against: https://ci.eclipse.org/jakartaee-tck/blue/rest/organizations/jenkins/pipelines/jaxb-tck/branches/master/runs/24/nodes/28/steps/31/log/?start=0

scottmarlow commented 5 years ago

Thanks @edbratt! Thanks @ankitoracle! I verified yesterday that the leak was resolved with this change. If we need to bring caching back in, I'd be happy to help with that or any problems with today's build/run.

I did look through the ^ ci.eclipse.org jaxb-tck job and don't yet see all involved steps (e.g. didn't see where the com.sun.javatest.agent.AgentMain is started). Will close since the pr is merged, thanks again!

scottmarlow commented 5 years ago

@ankitoracle is it okay if I start a https://ci.eclipse.org/jakartaee-tck/job/jaxb-tck/job/master/ run with this change? If that looks good, can I also do a https://ci.eclipse.org/jakartaee-tck/job/jaxb-tck-nightly-build-run run?

scottmarlow commented 5 years ago

Reopening until we see a successful CI job run...

ankitoracle commented 5 years ago

@scottmarlow You can trigger a run.We run the TCK in multi jvm mode without an agent in this jenkins job

scottmarlow commented 5 years ago

The jaxb-tck job run was successful, it took 6 hours 32 minutes (previous runs were: 6 h 26min, 6 h 30min, 6h 23min), so performance seems fine to me. Locally, my Jenkins test run took 6 hours 25 minutes (using the settings mentioned earlier above). Will now close this issue.