Closed ben-manes closed 8 years ago
Thanks for the report. But I really don't know how to fix the issue. Maybe you could take one old snapshot from https://oss.sonatype.org/content/repositories/snapshots/org/testng/testng/6.9.13-SNAPSHOT/ and use it as local dependency, and try if travis is still complaining. Maybe you could build your own testng jar from previous changesets in order to find which changes broke.
Let me know if we can help you.
I'll try to narrow it down to a patch release. It took a while to isolate the error to the dependency upgrade. I was hoping a change might jump out at you, so I took a shot and reported it before diving in deeper.
It seems to pass with jitpack.io & 8b56810, but fails on 6.9.13
. So might be somewhere between there that the change occurred.
What about later commits like a4edeaaf09 ? https://jitpack.io/com/github/cbeust/testng/a4edeaaf09/
I've had a lot of trouble with missing methods in the testng jars, even with the release tags, despite the code being correct. So it's been a bit painful to narrow down due to that quirk.
Of the differences in the range, ab71c9e looks suspicious. I wonder if this is causing the test report and GC to become exhausted?
There are well over 1M test executions due to heavy usage of parameterized testing. A data provider inspects a specification annotation to construct that cartesian set of possible input configurations. Afterward the test completes successfully, a listener validates the internal state and fails the test if corrupted. This approach has been very helpful, but does generate a lot of garbage and report verbosity.
As described in #584, the report retains the ITestResult#getParameters()
which avoids GC'ing until after the task has completed. To resolve this the parameters are replaced with the toString()
value.
I suspect those changes allowed the reporter to retain the parameters and disallowed my optimization. It would result in the observed GC exhaustion and long pauses, and only be visible in a memory constrained environment like Travis.
Could you try to disable default listeners? https://docs.gradle.org/current/javadoc/org/gradle/api/tasks/testing/testng/TestNGOptions.html#setUseDefaultListeners(boolean)
Default is off and I don't configure them further.
Sorry, I referenced the wrong commit. I meant f92a87243ad299823a343ad9b40b1573563fafcf.
I also disagree with #447 and consider it to be bad code. I think that request should have been rejected instead of making changes to copy and retain the parameters.
Now that I understand wha's going on, this is easy to reproduce locally by setting maxHeapSize = '512m'
in code_quality.gradle
. Then running a task like strongKeysAndStrongValuesAsyncCaffeineTest
runs into GC problems. Somehow I need to be able to pass to the reporter only the param toString() instead of it retaining all of them in-memory.
@ben-manes Could you try with compile 'com.github.juherr:testng:162ae8f923'
and let me know if the fix the problem?
Could you provide me the steps to be able to reproduce the issue on my computer ?
Sure! I can easily show you on my project. It might be hard to write a good integration test since we're trying to force GC exhaustion.
$ git clone git@github.com:ben-manes/caffeine.git
$ git checkout juherr
$ ./gradlew :caffeine:strongKeysAndStrongValuesAsyncCaffeineTest
You'll see it begin to slow down at around 45,000 test executions due to visible GC pauses.
It looks the replacement of params is still working. The retain is not located in the test result or reporter.
We should go deeper to understand where is the retain.
Thanks for looking. A profiler might be able to identify the roots. I have a hook for attaching a debugger using -Ddebug
, so I may have used that when I first identified needing to replace the params. When you use that flag then it will build and pause before executing the test task.
A profiler might be able to identify the roots.
It's a bit out of my competencies and I have not enough free time to start learning it. Could you check and tell me if you find something?
I did the following,
That provides the single root of,
this - value: com.github.benmanes.caffeine.cache.testing.CacheContext #8
<- [1] - class: java.lang.Object[], value: com.github.benmanes.caffeine.cache.testing.CacheContext #8
<- m_parameters - class: org.testng.internal.InvokedMethod, value: java.lang.Object[] #46
<- [44843] - class: java.lang.Object[], value: org.testng.internal.InvokedMethod #9
<- elementData - class: java.util.ArrayList, value: java.lang.Object[] #15563
<- m_invokedMethods (Java frame) - class: org.testng.TestRunner, value: java.util.ArrayList #18020
So it appears that InvokedMethod#m_parameters
is retaining the results.
@juherr
The issue is now pretty clear if you read from Invoker#invokeMethod
. First it passes the params[] to TestResult#setParameters(params)
, which was modified to copy the incoming array. Then it provides the params[] source array to new InvokedMethod(...)
. The InvokedMethod
is stored on the TestRunner which retains the information for running one test class.
In previous versions the params[] instance was shared between all consumers. This allowed the listener to modify its contents after execution to eagerly stringify them. Now that different instances are held, this optimization cannot be performed and available memory is exhausted.
Ideally TestNG would try to discard the parameters aggressively and replace them with (hopefully) a more compact form. This has to occur after the test results are processed so that they can be inspected.
I would argue both my expectation and the #447 was relying on poor assumptions. In #447 the author wants the framework to workaround his obviously poor code. The fix of performing a Cloneable#clone()
is error prone, as only core JDK objects implement this. This may be a deep or shallow clone, and he makes many invalid assumptions for how mutations can be managed. His issue will still occur if any type is used that is not deeply cloneable. The TestNG is now less obvious and predictable, and the issue should have been closed as "not fix" with feedback to teach the author best practices. Its was a newbie mistake.
In my case, the poor assumption is a workaround that I'd be fine removing if resolved. Its a stress test of this framework, demonstrates where it fails, and tries to fix it to move forward. The build is already split into multiple runs of the test class (via different configurations) to reduce the stress placed on TestNG. This may also no longer be needed, as I had not then identified the problem and only saw Travis kill the job as no console feedback.
Thanks for the analyze.
The InvokedMethod is stored on the TestRunner which retains the information for running one test class.
If I understand you well, the issue will be fixed if we remove the retain from InvokedMethod
(which is totally useless there!!)?
It appears the m_parameters
in InvokedMethod is used only for the toString()
, so I agree its probably useless. It looks like it could instead pull them from m_testResult
instead of a direct reference.
Can you make a build and I'll give it a run?
For some reason jitpack sometimes drops ITestNGMethod#getConstructorOrMethod()
and fails to build. No idea why, but causes the compile to fail.
(See results)
Passed :)
Thanks to TestNG's flexibility, the travis job runs well over 1 million executions via the data provider composing the cartesian set of possibilities on 380 @Test
methods. That's 1h22m job and really helpful to catch mistakes.
Thanks for the help and fixing this issue.
And thank you for looking into this @ben-manes, very much appreciated!
In 6.9.12 the build passes. When the testng dependency is updated to 6.9.13.8 the build fails due to a timeout.
The failure is due to a timeout. In the above an Awaitility assertion forces the tests to complete unsuccessfully. If that condition is removed then Travis will halt due to no console output after 10 minutes.
This failure does not occur locally. That indicates that there is a race condition causing some part of the framework to block indefinitely. The common case is a data race, e.g. counting too many items and waiting for a completion notification that won't occur. Or it could be a swallowed exception without notifying the waiting thread. Unfortunately without a thread dump its not clear what is blocked.