mikemccand / stargazers-migration-test

Testing Lucene's Jira -> GitHub issues migration
0 stars 0 forks source link

disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399 [LUCENE-8991] #988

Closed mikemccand closed 4 years ago

mikemccand commented 5 years ago

An incredibly common class of jenkins failure (at least in Solr tests) stems from triggering assertion failures in java.util.HashMap – evidently triggering bug JDK-8205399, first introduced in java-10, and fixed in java-12, but has never been backported to any java-10 or java-11 bug fix release...

https://bugs.openjdk.java.net/browse/JDK-8205399

SOLR-13653 tracks how this bug can affect Solr users, but I think it would make sense to disable java.util.HashMap in our build system to reduce the confusing failures when users/jenkins runs tests, since there is nothing we can do to work around this when testing with java-11 (or java-10 on branch_8x)


Legacy Jira details

LUCENE-8991 by Chris M. Hostetter on Sep 26 2019, resolved Oct 07 2019 Attachments: LUCENE-8991.patch (versions: 2) Linked issues:

mikemccand commented 5 years ago

FYI: I tried asking Rory about the open-jdk backporting process and why the fix for JDK-8205399 had never been backported for inclusion in 11.0.2 or 11.0.3 (or at this point 11.0.4) given how long the issue is been known relative to when those releases came out, but never got a response...

http://mail-archives.apache.org/mod_mbox/lucene-dev/201907.mbox/%3calpine.DEB.2.11.1907251029100.10893@tray%3e


An example of how this type of bug can manifest in our tests, from a recent jenkins failure...

   [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom -Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8
   [junit4] ERROR   13.4s J1 | TestCloudJSONFacetSKG.testRandom <<<
   [junit4]    > Throwable #1: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://127.0.0.1:43673/solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection: Expected mime type application/octet-stream but got text/html. <html>
   [junit4]    > <head>
   [junit4]    > <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
   [junit4]    > <title>Error 500 Server Error</title>
   [junit4]    > </head>
   [junit4]    > <body><h2>HTTP ERROR 500</h2>
   [junit4]    > <p>Problem accessing /solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection/select. Reason:
   [junit4]    > <pre>    Server Error</pre></p><h3>Caused by:</h3><pre>java.lang.AssertionError
   [junit4]    >        at java.base/java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1896)
   [junit4]    >        at java.base/java.util.HashMap$TreeNode.putTreeVal(HashMap.java:2061)
   [junit4]    >        at java.base/java.util.HashMap.putVal(HashMap.java:633)
   [junit4]    >        at java.base/java.util.HashMap.put(HashMap.java:607)
   [junit4]    >        at org.apache.solr.search.LRUCache.putCacheValue(LRUCache.java:295)
   [junit4]    >        at org.apache.solr.search.LRUCache.put(LRUCache.java:268)
   [junit4]    >        at org.apache.solr.search.SolrCacheHolder.put(SolrCacheHolder.java:92)

TestCloudJSONFacetSKG seems to trigger this assertion bug a lot, but I've also seen it pop up in other tests. I haven't really dug into the details of JDK-8205399, but I suspect the size/balancing/rebalancing of the HashMap is what tickles the affected code path, so (i guess) tests that result in largeish HashMaps seem more likely to trigger it?


The attached path is a simple change to lucene/common-build.xml to modify our existing use of "\-ea \-esa" into "\-ea \-esa \-da:java.util.HashMap"

Any objections?

[Legacy Jira: Chris M. Hostetter on Sep 26 2019]

mikemccand commented 5 years ago

Anything that improves test stability is welcome of course....

 

May I ask that you add the fact that it's been fixed in Java 12 to the comment in the code? Just so 3 years from now when someone looks at it after we go to JDK12+ as a minimum requirement it's obvious that it can be removed.

[Legacy Jira: Erick Erickson (@ErickErickson) on Sep 26 2019]

mikemccand commented 5 years ago
-1 overall
Vote Subsystem Runtime Comment
Prechecks
-1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
master Compile Tests
+1 compile 0m 0s master passed
Patch Compile Tests
+1 compile 0m 0s the patch passed
+1 javac 0m 0s the patch passed
+1 Release audit (RAT) 0m 0s the patch passed
+1 Validate source patterns 0m 0s the patch passed
Other Tests
0m 53s
Subsystem Report/Notes
JIRA Issue LUCENE-8991
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12981459/LUCENE-8991.patch
Optional Tests compile javac unit ratsources validatesourcepatterns
uname Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / ec9780c8aad
ant version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019
Default Java LTS
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/207/testReport/
modules C: lucene U: lucene
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/207/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Sep 26 2019]

mikemccand commented 5 years ago

I'll do you one better Erick...

In the updated patch, -da:java.util.HashMap is used if and only if this appears to be openjdk based JVM with spec version 10 or 11 (same logic as used in the existing documentation-lint.supported condition) and tests.asserts.hashmap is "false" (or unset) ... so people can override it this logic with a single -Dtests.asserts.hashmap=true.

which means on trunk today:

1) this line fails fairly reliably for me using openjdk 11.0.4...

ant test  -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom -Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8

..but with the patch applied it passes reliably for me, indicating that the disabling the assertions on HashMap prevented the failure.

2) with the patch applied and the override prop added...

ant test  -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom -Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8 -Dtests.asserts.hashmap=true

...it starts to fail reliably for me again, indicating that the override works

[Legacy Jira: Chris M. Hostetter on Sep 27 2019]

mikemccand commented 5 years ago
-1 overall
Vote Subsystem Runtime Comment
Prechecks
-1 test4tests 0m 0s The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
master Compile Tests
+1 compile 0m 0s master passed
Patch Compile Tests
+1 compile 0m 0s the patch passed
+1 javac 0m 0s the patch passed
+1 Release audit (RAT) 0m 0s the patch passed
+1 Validate source patterns 0m 0s the patch passed
Other Tests
0m 38s
Subsystem Report/Notes
JIRA Issue LUCENE-8991
JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12981506/LUCENE-8991.patch
Optional Tests compile javac unit ratsources validatesourcepatterns
uname Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool ant
Personality /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision master / a9cf5f6
ant version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018
Default Java LTS
Test Results https://builds.apache.org/job/PreCommit-LUCENE-Build/208/testReport/
modules C: lucene U: lucene
Console output https://builds.apache.org/job/PreCommit-LUCENE-Build/208/console
Powered by Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

[Legacy Jira: Lucene/Solr QA on Sep 27 2019]

mikemccand commented 5 years ago

Commit 10da07a396777e3e7cfb091c5dec826b6df11284 in lucene-solr's branch refs/heads/master from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=10da07a

LUCENE-8991: disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399

[Legacy Jira: ASF subversion and git services on Oct 02 2019]

mikemccand commented 5 years ago

Commit 60b9ec0866e6223afb269fa377203f731cca2973 in lucene-solr's branch refs/heads/branch_8x from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=60b9ec0

LUCENE-8991: disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399

(cherry picked from commit 10da07a396777e3e7cfb091c5dec826b6df11284)

[Legacy Jira: ASF subversion and git services on Oct 03 2019]

mikemccand commented 4 years ago

This just popped up again in thetaphi_Lucene-Solr-8.x-Linux_1352.log.txt, using 1.8.0_201 which doesn't jive w/my original understanding of the issue...

But going back down the rabbit hole of JDK-8205399 (where the assertion failure is discussed/fixed) I see now that even though that issue says the problem started in java10, looking at the linked issue JDK-8186171 (where the change was made that actually caused the assertion to start failing) i see it has other linked issues indicating that the (broken) change was actually backported to "9u-open" and "8u201"

So i'm going to revisit this change and fix common-build.xml to check for any java less then 12.

[Legacy Jira: Chris M. Hostetter on Oct 07 2019]

mikemccand commented 4 years ago

Commit bc0652ecc09ba6b82d86a8050b541a96f2a0f888 in lucene-solr's branch refs/heads/master from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bc0652e

LUCENE-8991: disable HashMap assertions (by default) on java9 and java1.8 as well

[Legacy Jira: ASF subversion and git services on Oct 07 2019]

mikemccand commented 4 years ago

Commit 48e19ab4c3af42058ac29a92983163cdf8a147f0 in lucene-solr's branch refs/heads/branch_8x from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=48e19ab

LUCENE-8991: disable HashMap assertions (by default) on java9 and java1.8 as well

(cherry picked from commit bc0652ecc09ba6b82d86a8050b541a96f2a0f888)

[Legacy Jira: ASF subversion and git services on Oct 07 2019]

mikemccand commented 4 years ago

So i'm going to revisit this change and fix common-build.xml to check for any java less then 12.

ant doesn't make it easy to do "less than" and java8's spec version was "1.8", so an integer comparison wasn't going to work anyway ... so i just added "9" and "1.8" to the conditional "or".

tested again using java11 and java12 on master AND using java8 on branch8x, both with and w/o the new -Dtests.asserts.hashmap=true

[Legacy Jira: Chris M. Hostetter on Oct 07 2019]

mikemccand commented 4 years ago

Commit bc0652ecc09ba6b82d86a8050b541a96f2a0f888 in lucene-solr's branch refs/heads/jira/SOLR-13821 from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bc0652e

LUCENE-8991: disable HashMap assertions (by default) on java9 and java1.8 as well

[Legacy Jira: ASF subversion and git services on Oct 08 2019]

mikemccand commented 2 years ago

Closing after the 9.0.0 release

[Legacy Jira: Adrien Grand (@jpountz) on Dec 08 2021]