opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.76k stars 1.82k forks source link

[BUG] Fix flaky org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse test #4259

Open reta opened 2 years ago

reta commented 2 years ago

Describe the bug Another flaky test, introduced by https://github.com/opensearch-project/OpenSearch/pull/3921/files

@Bukhtawar could you please take a look?

To Reproduce

REPRODUCE WITH: ./gradlew ':server:test' --tests "org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse" -Dtests.seed=96F3F0BE216E9128 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=zh-Hant-TW -Dtests.timezone=America/Paramaribo -Druntime.java=17

Expected behavior Test must always pass

Plugins Standard

Screenshots

<h3>Error Message</h3><pre>java.lang.AssertionError</pre><h3>Stacktrace</h3><pre>java.lang.AssertionError
    at __randomizedtesting.SeedInfo.seed([96F3F0BE216E9128:D64BFD8616D1599]:0)
    at org.junit.Assert.fail(Assert.java:87)
    at org.junit.Assert.assertTrue(Assert.java:42)
    at org.junit.Assert.assertTrue(Assert.java:53)
    at org.opensearch.search.PitMultiNodeTests$1.onNodeStopped(PitMultiNodeTests.java:99)
    at org.opensearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:1028)
    at org.opensearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:1931)
    at org.opensearch.test.InternalTestCluster.restartRandomDataNode(InternalTestCluster.java:1879)
    at org.opensearch.search.PitMultiNodeTests.testCreatePitWhileNodeDropWithAllowPartialCreationFalse(PitMultiNodeTests.java:93)</pre>

Host/Environment (please complete the following information):

Additional context See please https://build.ci.opensearch.org/job/gradle-check/1902/testReport/junit/org.opensearch.search/PitMultiNodeTests/testCreatePitWhileNodeDropWithAllowPartialCreationFalse/

bharath-techie commented 1 year ago

https://github.com/opensearch-project/OpenSearch/pull/4632/files -- this has the fix and it was merged already. I was not able to reproduce the issue with 500 iterationss in local.

Post 2.x I don't see this issue being tagged as well. We can close the issue if this issue is not seen in the latest builds.

dbwiddis commented 1 year ago

NOTE: The fix in #4632 was for the initial report in #4089, not for this issue. This bug still exists and is referenced in comments starting here

dbwiddis commented 1 year ago

Copying a summary of comment over from #4089 where several repro logs and other comments were made. See that issue for background.

Have spent several hours digging into this issue.

For the create failure; it's obviously a race condition and close to a Heisenbug. While regularly reproducible at least once in a run of size 500, adding logging extended successful runs with a failure on run 2463.

Commonalities when it fails:

Tracing through the debug logs I added, I'm at the "how did this ever work" phase of debugging. The boolean allowPartialSearchResults is never referred to in the create workflow called as part of this. In particular, the PIT creation failure occurs in this block of code: https://github.com/opensearch-project/OpenSearch/blob/ea1cc9d054b2d7cf0046438ecc57777e29112c2b/server/src/main/java/org/opensearch/action/search/CreatePitController.java#L123-L153

The getter searchRequest.allowPartialSearchResults() is never referred to here; call hierarchy indicates it's called on multi phase requests at the beginning of AbstractSearchAsyncAction.executeNextPhase() but there is no reference to it in SinglePhaseSearchAction which is called here. The SearchRequest variable request isn't even passed to the constructor, just the search task.

Looks like this code was implemented in #3921. @bharath-techie do you have any observations/comments that can further help debug this?

dbwiddis commented 1 year ago

Digging into this, the original commit #3291 did not cause this failure; the unneeded test was added by #4030. I'll back out the line causing the failures.

reta commented 8 months ago
java.lang.AssertionError: expected:<0> but was:<1>
    at __randomizedtesting.SeedInfo.seed([561A7A0B4C9B2562:CD8D356D0C98A1D3]:0)
    at org.junit.Assert.fail(Assert.java:89)
    at org.junit.Assert.failNotEquals(Assert.java:835)
    at org.junit.Assert.assertEquals(Assert.java:647)
    at org.junit.Assert.assertEquals(Assert.java:633)
    at org.opensearch.search.pit.PitMultiNodeIT.validatePitStats(PitMultiNodeIT.java:363)
    at org.opensearch.search.pit.PitMultiNodeIT$1.onNodeStopped(PitMultiNodeIT.java:118)
    at org.opensearch.test.InternalTestCluster$NodeAndClient.closeForRestart(InternalTestCluster.java:1072)
    at org.opensearch.test.InternalTestCluster.restartNode(InternalTestCluster.java:2032)
    at org.opensearch.test.InternalTestCluster.restartRandomDataNode(InternalTestCluster.java:1980)
    at org.opensearch.search.pit.PitMultiNodeIT.testCreatePitWhileNodeDropWithAllowPartialCreationFalse(PitMultiNodeIT.java:112)
    at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
    at java.base/java.lang.reflect.Method.invoke(Method.java:580)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
    at org.opensearch.test.OpenSearchTestClusterRule$1.evaluate(OpenSearchTestClusterRule.java:369)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.junit.rules.RunRules.evaluate(RunRules.java:20)
    at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at org.junit.rules.RunRules.evaluate(RunRules.java:20)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
    at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
    at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
    at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
    at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
    at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
    at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
    at org.junit.rules.RunRules.evaluate(RunRules.java:20)
    at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
    at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
    at java.base/java.lang.Thread.run(Thread.java:1583)

The issue is back: