6.9.13.8 timeout on Travis

ben-manes commented 8 years ago

In 6.9.12 the build passes. When the testng dependency is updated to 6.9.13.8 the build fails due to a timeout.

The failure is due to a timeout. In the above an Awaitility assertion forces the tests to complete unsuccessfully. If that condition is removed then Travis will halt due to no console output after 10 minutes.

This failure does not occur locally. That indicates that there is a race condition causing some part of the framework to block indefinitely. The common case is a data race, e.g. counting too many items and waiting for a completion notification that won't occur. Or it could be a swallowed exception without notifying the waiting thread. Unfortunately without a thread dump its not clear what is blocked.

juherr commented 8 years ago

Thanks for the report. But I really don't know how to fix the issue. Maybe you could take one old snapshot from https://oss.sonatype.org/content/repositories/snapshots/org/testng/testng/6.9.13-SNAPSHOT/ and use it as local dependency, and try if travis is still complaining. Maybe you could build your own testng jar from previous changesets in order to find which changes broke.

Let me know if we can help you.

ben-manes commented 8 years ago

I'll try to narrow it down to a patch release. It took a while to isolate the error to the dependency upgrade. I was hoping a change might jump out at you, so I took a shot and reported it before diving in deeper.

ben-manes commented 8 years ago

It seems to pass with jitpack.io & 8b56810, but fails on 6.9.13. So might be somewhere between there that the change occurred.

juherr commented 8 years ago

What about later commits like a4edeaaf09 ? https://jitpack.io/com/github/cbeust/testng/a4edeaaf09/

ben-manes commented 8 years ago

I've had a lot of trouble with missing methods in the testng jars, even with the release tags, despite the code being correct. So it's been a bit painful to narrow down due to that quirk.

ben-manes commented 8 years ago

Unfortunately a4edeaa also failed.

ben-manes commented 8 years ago

Of the differences in the range, ab71c9e looks suspicious. I wonder if this is causing the test report and GC to become exhausted?

There are well over 1M test executions due to heavy usage of parameterized testing. A data provider inspects a specification annotation to construct that cartesian set of possible input configurations. Afterward the test completes successfully, a listener validates the internal state and fails the test if corrupted. This approach has been very helpful, but does generate a lot of garbage and report verbosity.

As described in #584, the report retains the ITestResult#getParameters() which avoids GC'ing until after the task has completed. To resolve this the parameters are replaced with the toString() value.

I suspect those changes allowed the reporter to retain the parameters and disallowed my optimization. It would result in the observed GC exhaustion and long pauses, and only be visible in a memory constrained environment like Travis.

juherr commented 8 years ago

Could you try to disable default listeners? https://docs.gradle.org/current/javadoc/org/gradle/api/tasks/testing/testng/TestNGOptions.html#setUseDefaultListeners(boolean)

ben-manes commented 8 years ago

Default is off and I don't configure them further.

ben-manes commented 8 years ago

Sorry, I referenced the wrong commit. I meant f92a87243ad299823a343ad9b40b1573563fafcf.

ben-manes commented 8 years ago

I also disagree with #447 and consider it to be bad code. I think that request should have been rejected instead of making changes to copy and retain the parameters.

ben-manes commented 8 years ago

Now that I understand wha's going on, this is easy to reproduce locally by setting maxHeapSize = '512m' in code_quality.gradle. Then running a task like strongKeysAndStrongValuesAsyncCaffeineTest runs into GC problems. Somehow I need to be able to pass to the reporter only the param toString() instead of it retaining all of them in-memory.

juherr commented 8 years ago

@ben-manes Could you try with compile 'com.github.juherr:testng:162ae8f923' and let me know if the fix the problem?

ben-manes commented 8 years ago

Unfortunately it failed.

juherr commented 8 years ago

Could you provide me the steps to be able to reproduce the issue on my computer ?

ben-manes commented 8 years ago

Sure! I can easily show you on my project. It might be hard to write a good integration test since we're trying to force GC exhaustion.

$ git clone git@github.com:ben-manes/caffeine.git
$ git checkout juherr
$ ./gradlew :caffeine:strongKeysAndStrongValuesAsyncCaffeineTest

You'll see it begin to slow down at around 45,000 test executions due to visible GC pauses.

juherr commented 8 years ago

It looks the replacement of params is still working. The retain is not located in the test result or reporter.

We should go deeper to understand where is the retain.

ben-manes commented 8 years ago

Thanks for looking. A profiler might be able to identify the roots. I have a hook for attaching a debugger using -Ddebug, so I may have used that when I first identified needing to replace the params. When you use that flag then it will build and pause before executing the test task.

juherr commented 8 years ago

A profiler might be able to identify the roots.

It's a bit out of my competencies and I have not enough free time to start learning it. Could you check and tell me if you find something?

ben-manes commented 8 years ago

I did the following,

Ran the gradle task until it began to slow down (~50k)
jmap -dump:live,format=b,file=/tmp/dump.hrpof
Opened the dump in jvisualvm
Found a parameter in class view (CacheContext is always passed in)
Viewed instances
Viewed nearest GC roots

That provides the single root of,

this     - value: com.github.benmanes.caffeine.cache.testing.CacheContext #8
 <- [1]     - class: java.lang.Object[], value: com.github.benmanes.caffeine.cache.testing.CacheContext #8
  <- m_parameters     - class: org.testng.internal.InvokedMethod, value: java.lang.Object[] #46
   <- [44843]     - class: java.lang.Object[], value: org.testng.internal.InvokedMethod #9
    <- elementData     - class: java.util.ArrayList, value: java.lang.Object[] #15563
     <- m_invokedMethods (Java frame)     - class: org.testng.TestRunner, value: java.util.ArrayList #18020

So it appears that InvokedMethod#m_parameters is retaining the results.

ben-manes commented 8 years ago

@juherr

The issue is now pretty clear if you read from Invoker#invokeMethod. First it passes the params[] to TestResult#setParameters(params), which was modified to copy the incoming array. Then it provides the params[] source array to new InvokedMethod(...). The InvokedMethod is stored on the TestRunner which retains the information for running one test class.

In previous versions the params[] instance was shared between all consumers. This allowed the listener to modify its contents after execution to eagerly stringify them. Now that different instances are held, this optimization cannot be performed and available memory is exhausted.

Ideally TestNG would try to discard the parameters aggressively and replace them with (hopefully) a more compact form. This has to occur after the test results are processed so that they can be inspected.

I would argue both my expectation and the #447 was relying on poor assumptions. In #447 the author wants the framework to workaround his obviously poor code. The fix of performing a Cloneable#clone() is error prone, as only core JDK objects implement this. This may be a deep or shallow clone, and he makes many invalid assumptions for how mutations can be managed. His issue will still occur if any type is used that is not deeply cloneable. The TestNG is now less obvious and predictable, and the issue should have been closed as "not fix" with feedback to teach the author best practices. Its was a newbie mistake.

In my case, the poor assumption is a workaround that I'd be fine removing if resolved. Its a stress test of this framework, demonstrates where it fails, and tries to fix it to move forward. The build is already split into multiple runs of the test class (via different configurations) to reduce the stress placed on TestNG. This may also no longer be needed, as I had not then identified the problem and only saw Travis kill the job as no console feedback.

juherr commented 8 years ago

Thanks for the analyze.

The InvokedMethod is stored on the TestRunner which retains the information for running one test class.

If I understand you well, the issue will be fixed if we remove the retain from InvokedMethod (which is totally useless there!!)?

ben-manes commented 8 years ago

It appears the m_parameters in InvokedMethod is used only for the toString(), so I agree its probably useless. It looks like it could instead pull them from m_testResult instead of a direct reference.

Can you make a build and I'll give it a run?

juherr commented 8 years ago

You can try with https://github.com/juherr/testng/commit/5e284802a0c89261058ea020681c9d3ad229e582

ben-manes commented 8 years ago

For some reason jitpack sometimes drops ITestNGMethod#getConstructorOrMethod() and fails to build. No idea why, but causes the compile to fail.

(See results)

ben-manes commented 8 years ago

Passed locally. Now running on travis

ben-manes commented 8 years ago

Passed :)

Thanks to TestNG's flexibility, the travis job runs well over 1 million executions via the data provider composing the cartesian set of possibilities on 380 @Test methods. That's 1h22m job and really helpful to catch mistakes.

Thanks for the help and fixing this issue.

cbeust commented 8 years ago

And thank you for looking into this @ben-manes, very much appreciated!

testng-team / testng

6.9.13.8 timeout on Travis #1173