testng-team / testng

TestNG testing framework
https://testng.org
Apache License 2.0
1.99k stars 1.02k forks source link

Severe thread contention while running large test with parallel methods #772

Closed benlamonica closed 7 years ago

benlamonica commented 9 years ago

While running a large test (600,000+ test methods) I encountered a deadlock situation.

All of the threads are blocked with this stack trace:

pool-102-thread-8 [227] (BLOCKED)
pool-102-thread-7 [226] (BLOCKED)
pool-102-thread-6 [225] (BLOCKED)
pool-102-thread-5 [224] (BLOCKED)
pool-102-thread-4 [223] (BLOCKED)
pool-102-thread-3 [222] (BLOCKED)
pool-102-thread-2 [221] (BLOCKED)
pool-102-thread-16 [235] (BLOCKED)
pool-102-thread-15 [234] (BLOCKED)
pool-102-thread-14 [233] (BLOCKED)
pool-102-thread-13 [232] (BLOCKED)
pool-102-thread-12 [231] (BLOCKED)
pool-102-thread-11 [230] (BLOCKED)
pool-102-thread-10 [229] (BLOCKED)
pool-102-thread-1 [220] (BLOCKED)
pool-102-thread-9 [228] (BLOCKED)
   org.testng.internal.thread.graph.GraphThreadPoolExecutor.setStatus line: 85 
   org.testng.internal.thread.graph.GraphThreadPoolExecutor.afterExecute line: 94 
   java.util.concurrent.ThreadPoolExecutor.runWorker line: 1150 
   java.util.concurrent.ThreadPoolExecutor$Worker.run line: 617 
   java.lang.Thread.run line: 745 

And the thread that is holding the mutex:

pool-1-thread-1 [23] (RUNNABLE)
   java.util.HashMap$TreeNode.find line: 1865 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.find line: 1861 
   java.util.HashMap$TreeNode.putTreeVal line: 1979 
   java.util.HashMap.putVal line: 637 
   java.util.HashMap.put line: 611 
   java.util.HashSet.add line: 219 
   org.testng.internal.DynamicGraph.setStatus line: 117 
   org.testng.internal.thread.graph.GraphThreadPoolExecutor.setStatus line: 86 
   org.testng.internal.thread.graph.GraphThreadPoolExecutor.runNodes line: 66 
   org.testng.internal.thread.graph.GraphThreadPoolExecutor.run line: 54 
   org.testng.TestRunner.privateRun line: 745 
   org.testng.TestRunner.run line: 621 
   org.testng.SuiteRunner.runTest line: 357 
   org.testng.SuiteRunner.runSequentially line: 352 
   org.testng.SuiteRunner.privateRun line: 310 
   org.testng.SuiteRunner.run line: 259 
   org.testng.SuiteRunnerWorker.runSuite line: 52 
   org.testng.SuiteRunnerWorker.run line: 86 
   org.testng.TestNG.runSuitesSequentially line: 1199 
   org.testng.TestNG.runSuitesLocally line: 1124 
   org.testng.TestNG.run line: 1032 
   org.apache.maven.surefire.testng.TestNGExecutor.run line: 295 
   org.apache.maven.surefire.testng.TestNGXmlTestSuite.execute line: 84 
   org.apache.maven.surefire.testng.TestNGProvider.invoke line: 90 
   sun.reflect.NativeMethodAccessorImpl.invoke0 line: not available [native method]
   sun.reflect.NativeMethodAccessorImpl.invoke line: 62 
   sun.reflect.DelegatingMethodAccessorImpl.invoke line: 43 
   java.lang.reflect.Method.invoke line: 497 
   org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray2 line: 208 
   org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke line: 155 
   org.apache.maven.surefire.booter.ProviderFactory.invokeProvider line: 81 
   org.apache.maven.plugin.surefire.InPluginVMSurefireStarter.runSuitesInProcess line: 82 
   org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider line: 974 
   org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked line: 824 
   org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute line: 722 
   org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo line: 132 
   org.apache.maven.lifecycle.internal.MojoExecutor.execute line: 208 
   org.apache.maven.lifecycle.internal.MojoExecutor.execute line: 153 
   org.apache.maven.lifecycle.internal.MojoExecutor.execute line: 145 
   org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject line: 116 
   org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject line: 80 
   org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build line: 51 
   org.apache.maven.lifecycle.internal.LifecycleStarter.execute line: 120 
   org.apache.maven.DefaultMaven.doExecute line: 347 
   org.apache.maven.DefaultMaven.execute line: 154 
   org.jvnet.hudson.maven3.launcher.Maven32Launcher.main line: 132 
   sun.reflect.NativeMethodAccessorImpl.invoke0 line: not available [native method]
   sun.reflect.NativeMethodAccessorImpl.invoke line: 62 
   sun.reflect.DelegatingMethodAccessorImpl.invoke line: 43 
   java.lang.reflect.Method.invoke line: 497 
   org.codehaus.plexus.classworlds.launcher.Launcher.launchStandard line: 330 
   org.codehaus.plexus.classworlds.launcher.Launcher.launch line: 238 
   jenkins.maven3.agent.Maven32Main.launch line: 181 
   sun.reflect.NativeMethodAccessorImpl.invoke0 line: not available [native method]
   sun.reflect.NativeMethodAccessorImpl.invoke line: 62 
   sun.reflect.DelegatingMethodAccessorImpl.invoke line: 43 
   java.lang.reflect.Method.invoke line: 497 
   hudson.maven.Maven3Builder.call line: 136 
   hudson.maven.Maven3Builder.call line: 71 
   hudson.remoting.UserRequest.perform line: 118 
   hudson.remoting.UserRequest.perform line: 48 
   hudson.remoting.Request$2.run line: 326 
   hudson.remoting.InterceptingExecutorService$1.call line: 72 
   java.util.concurrent.FutureTask.run line: 266 
   java.util.concurrent.ThreadPoolExecutor.runWorker line: 1142 
   java.util.concurrent.ThreadPoolExecutor$Worker.run line: 617 
   java.lang.Thread.run line: 745 
cbeust commented 9 years ago

I took a look at the source and I'm not sure this is the correct interpretation. m_graph is synchronized in three places in this class and one of them is setStatus. As far as I can tell, this method is always called outside of a synchronized block so that if thread A calls it while B is inside the synchronized block in setStatus, this should not result in a deadlock.

Do you have the full thread dump?

Any way you can reproduce this on a small project?

benlamonica commented 9 years ago

I think you're right, it is just taking a very long time for it to complete the test, and the threads are blocked because another thread is busy inserting into the graph.

I looked at the log again and I see that it's only partially done. I'm not sure why it's so slow though, when I ran this in JUnit, the tests execute very quickly. (Full test runs in 1-2 hours, in testng, it's been running for 11+ hours now). 272000 out of 668592 tests run

So this is not a bug, feel free to close the issue. I'm sure it's something with my test.

Thanks for the quick response.

cbeust commented 9 years ago

Thanks for the clarification.

Not sure I want to close this just yet, I'm not convinced yet that this is not a bug in TestNG, especially if you say that the same test runs fast in JUnit. Can you double check this? I'm a bit confused, since if the same tests are running in JUnit, then they are all independent (since JUnit doesn't support dependent tests) which means that the DynamicGraph use in TestNG should be minimal (this graph is only used when you have dependent test methods).

Can you report back on what you find? I want to make sure there's not some hidden slow algorithm in TestNG.

benlamonica commented 9 years ago

Let me start the jvm with a code profiler and run again.

Here are the particulars of the test:

I'll do a little more research and get back to you.

benlamonica commented 9 years ago

Here is the execution profiler.

Stack Trace Sample Count    Percentage(%)
java.lang.Thread.run()  1,037   98.95
   java.util.concurrent.ThreadPoolExecutor$Worker.run() 639 60.973
      java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker)  639 60.973
         org.testng.internal.thread.graph.GraphThreadPoolExecutor.afterExecute(Runnable, Throwable) 406 38.74
            org.testng.internal.thread.graph.GraphThreadPoolExecutor.setStatus(IWorker, DynamicGraph$Status)    406 38.74
               org.testng.internal.DynamicGraph.setStatus(Object, DynamicGraph$Status)  398 37.977
                  java.util.HashSet.add(Object) 298 28.435
                     java.util.HashMap.put(Object, Object)  298 28.435
                        java.util.HashMap.putVal(int, Object, Object, boolean, boolean) 298 28.435
                  org.testng.internal.DynamicGraph.removeNode(Object)   100 9.542
                     java.util.HashSet.remove(Object)   100 9.542
               java.util.ArrayList.remove(Object)   8   0.763
         org.testng.internal.TestMethodWorker.run() 227 21.66
            org.testng.internal.TestMethodWorker.invokeTestMethods(ITestNGMethod, Object, ITestContext) 227 21.66
               org.testng.internal.Invoker.invokeTestMethods(ITestNGMethod, XmlSuite, Map, ConfigurationGroupMethods, Object, ITestContext) 227 21.66
                  org.testng.internal.Invoker.invokeTestMethod(Object, ITestNGMethod, Object[], int, XmlSuite, Map, ITestClass, ITestNGMethod[], ITestNGMethod[], ConfigurationGroupMethods, Invoker$FailureContext)    226 21.565
                     org.testng.internal.Invoker.invokeMethod(Object, ITestNGMethod, Object[], int, XmlSuite, Map, ITestClass, ITestNGMethod[], ITestNGMethod[], ConfigurationGroupMethods, Invoker$FailureContext) 226 21.565
                        org.testng.internal.MethodInvocationHelper.invokeMethod(Method, Object, Object[])   224 21.374
                           java.lang.reflect.Method.invoke(Object, Object[])    224 21.374
                              sun.reflect.DelegatingMethodAccessorImpl.invoke(Object, Object[]) 224 21.374
                                 sun.reflect.GeneratedMethodAccessor135.invoke(Object, Object[])    224 21.374
                                    com.baml.gfx.pte.regression.PTERegression.db()  224 21.374
                                       com.baml.gfx.pte.regression.PTERegression.areIdenticalNodes(String, String, DifferenceListener, ElementQualifier)    224 21.374
                                          org.custommonkey.xmlunit.DetailedDiff.<init>(Diff)    94  8.969
                                          org.custommonkey.xmlunit.Diff.<init>(String, String)  84  8.015
                                          org.custommonkey.xmlunit.Diff.similar()   46  4.389
                        org.testng.internal.Invoker.invokeConfigurations(IClass, ITestNGMethod, ITestNGMethod[], XmlSuite, Map, Object[], Object, ITestResult)  1   0.095
                        org.testng.internal.MethodHelper.findExpectedExceptions(IAnnotationFinder, Method)  1   0.095
                  org.testng.internal.Invoker.filterMethods(IClass, ITestNGMethod[], Invoker$Predicate) 1   0.095

21.66% is spent on executing the test, and 61% is spent inserting into the graph set. (the rest is spent on our caching implementation, which suffered a disconnect to our dev environment and is busy rebuilding it's caches).

Do you know why we would be inserting into that graph?

benlamonica commented 9 years ago

It looks like the hashCode is based on the method, which would be the same for 100,000 entries. So each time it adds to the HashSet, it has to iterate over 100,000 instances in order to find that the object is not in the set? I will experiment with making the hashCode more unique...

cbeust commented 9 years ago

Can you turning off parallelism completely? I'm curious what this will change.

benlamonica commented 9 years ago

I turned off parallelism and now it's running about 1000 tests every 15 seconds. I don't see any of the delays I was seeing before.

rschmitt commented 9 years ago

Doesn't sound like deadlock so much as severe lock contention.

On Tue, Aug 4, 2015 at 5:04 PM, Ben La Monica notifications@github.com wrote:

I turned off parallelism and now it's running about 1000 tests every 15 seconds. I don't see any of the delays I was seeing before.

— Reply to this email directly or view it on GitHub https://github.com/cbeust/testng/issues/772#issuecomment-127800321.

cbeust commented 9 years ago

I turned off parallelism and now it's running about 1000 tests every 15 seconds. I don't see any of the delays I was seeing before.

That was my intuition, glad to see it confirmed. Like @rschmitt is saying, it's most likely a contention problem.

benlamonica commented 9 years ago

Here is a simplified example that illustrates the problem.

import static org.junit.Assert.assertNotNull;

import java.util.concurrent.atomic.AtomicInteger;

import org.testng.annotations.AfterMethod;
import org.testng.annotations.DataProvider;
import org.testng.annotations.Factory;
import org.testng.annotations.Test;

public class ContentionTest {

    private Integer testData;

    @Factory(dataProvider="data")
    public ContentionTest(Integer data) {
        testData = data;
    }

    @DataProvider(name="data", parallel=true)
    public static Object[][] getData() {
        Integer[][] data = new Integer[600000][1];
        for(int i = 0; i < 600000; i++) {
            data[i] = new Integer[] { i };
        }
        return data;
    }
    private static AtomicInteger testCount = new AtomicInteger(0);

    @AfterMethod
    public void countTests() {
        int count = testCount.incrementAndGet();
        if (count % 1000 == 0) {
            System.out.println(count + " tests completed.");
        }
    }

    @Test
    public void one() {
        assertNotNull("obj", testData);
    }

    @Test
    public void two() {
        assertNotNull("obj", testData);
    }

    @Test
    public void three() {
        assertNotNull("obj", testData);
    }

    @Test
    public void four() {
        assertNotNull("obj", testData);
    }

    @Test
    public void five() {
        assertNotNull("obj", testData);
    }

    @Test
    public void six() {
        assertNotNull("obj", testData);
    }
}

And the suite:

<suite name="contention suite">

  <test name="contention tests" parallel="methods" thread-count="6">
    <classes>
      <class name="ContentionTest"/>
    </classes>
  </test>
</suite>
benlamonica commented 9 years ago

Unfortunately I don't have any more time to devote to this issue. I'm going to have to go and find a different solution. Let me know if I can contribute any more info to you. Thanks for the great testing framework!

cbeust commented 9 years ago

Thanks for the PR and your efforts, @benlamonica, I'll take it from there!

testn commented 9 years ago

@cbeust, @juherr the similar fix of #775 has been broken down into #848, #849, #850, #852. Each one of them improves different path of the code path that makes #772 to take such a long time to execute. If you run #772 tests on the code before and after the fix, you will see that the time it takes to execute the tests are significantly reduced.

juherr commented 9 years ago

@testn The problem is we have no proof it or any idea how much we can. Then, it will take time to reproduce and test it on our side. Last time I tried, I had to stop before the end.

The hardest and most important thing with perf issues are not the fix, but the way to reproduce and to measure it easily.

But maybe you can help us?

testn commented 9 years ago

@juherr, just try it with a lower number. In my case, 6000 is already big enough to see the problem. If you want to start validating issues, you probably need to create a lot of test cases for performance testing in which I don't see it much in TestNG.

testn commented 9 years ago

@benlamonica , can you help provide the result of the test?

shaburov commented 7 years ago

v 6.9.13.3 In testNG no mechanisms for forced termination of the child threads. Because of this, the execution of tests in a loop (stability test) simply hangs in the case of deadlock.

Exception in thread "pool-309-thread-5" java.lang.ArrayIndexOutOfBoundsException: 49
        at java.util.ArrayList.fastRemove(ArrayList.java:546)
        at java.util.ArrayList.remove(ArrayList.java:529)
        at org.testng.internal.thread.graph.GraphThreadPoolExecutor.setStatus(GraphThreadPoolExecutor.java:82)
        at org.testng.internal.thread.graph.GraphThreadPoolExecutor.afterExecute(GraphThreadPoolExecutor.java:94)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1150)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

image

juherr commented 7 years ago

@shaburov Do you have a way to reproduce it every time?

shaburov commented 7 years ago

@juherr Absolutely floating defect. But since my tests are performed continuously in the monitoring mode, the reproduced defect in the interval 2-24 hours.

But I can guarantee that it is reproduced with the following settings: testNG.setSuiteThreadPoolSize (< equal to the number launched suites >); XmlSuite.setThreadCount (<5 and more>);

The error is not difficult to find. image

juherr commented 7 years ago

Collections.synchronizedList() on https://github.com/cbeust/testng/blame/master/src/main/java/org/testng/internal/thread/graph/GraphThreadPoolExecutor.java#L29 is smarter IMO.

Is the fix resolving your issue?

You made the most difficult part of the job, why not proposing a pull request then? :)

shaburov commented 7 years ago

@juherr Pull request https://github.com/cbeust/testng/pull/1305