Closed mikemccand closed 4 years ago
FYI: I tried asking Rory about the open-jdk backporting process and why the fix for JDK-8205399 had never been backported for inclusion in 11.0.2 or 11.0.3 (or at this point 11.0.4) given how long the issue is been known relative to when those releases came out, but never got a response...
An example of how this type of bug can manifest in our tests, from a recent jenkins failure...
[junit4] 2> NOTE: reproduce with: ant test -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom -Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8
[junit4] ERROR 13.4s J1 | TestCloudJSONFacetSKG.testRandom <<<
[junit4] > Throwable #1: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at https://127.0.0.1:43673/solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection: Expected mime type application/octet-stream but got text/html. <html>
[junit4] > <head>
[junit4] > <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
[junit4] > <title>Error 500 Server Error</title>
[junit4] > </head>
[junit4] > <body><h2>HTTP ERROR 500</h2>
[junit4] > <p>Problem accessing /solr/org.apache.solr.search.facet.TestCloudJSONFacetSKG_collection/select. Reason:
[junit4] > <pre> Server Error</pre></p><h3>Caused by:</h3><pre>java.lang.AssertionError
[junit4] > at java.base/java.util.HashMap$TreeNode.moveRootToFront(HashMap.java:1896)
[junit4] > at java.base/java.util.HashMap$TreeNode.putTreeVal(HashMap.java:2061)
[junit4] > at java.base/java.util.HashMap.putVal(HashMap.java:633)
[junit4] > at java.base/java.util.HashMap.put(HashMap.java:607)
[junit4] > at org.apache.solr.search.LRUCache.putCacheValue(LRUCache.java:295)
[junit4] > at org.apache.solr.search.LRUCache.put(LRUCache.java:268)
[junit4] > at org.apache.solr.search.SolrCacheHolder.put(SolrCacheHolder.java:92)
TestCloudJSONFacetSKG seems to trigger this assertion bug a lot, but I've also seen it pop up in other tests. I haven't really dug into the details of JDK-8205399, but I suspect the size/balancing/rebalancing of the HashMap is what tickles the affected code path, so (i guess) tests that result in largeish HashMaps seem more likely to trigger it?
The attached path is a simple change to lucene/common-build.xml to modify our existing use of "\-ea \-esa"
into "\-ea \-esa \-da:java.util.HashMap"
Any objections?
[Legacy Jira: Chris M. Hostetter on Sep 26 2019]
Anything that improves test stability is welcome of course....
May I ask that you add the fact that it's been fixed in Java 12 to the comment in the code? Just so 3 years from now when someone looks at it after we go to JDK12+ as a minimum requirement it's obvious that it can be removed.
[Legacy Jira: Erick Erickson (@ErickErickson) on Sep 26 2019]
❌ -1 overall |
---|
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
Prechecks | |||
-1 | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. |
master Compile Tests | |||
+1 | compile | 0m 0s | master passed |
Patch Compile Tests | |||
+1 | compile | 0m 0s | the patch passed |
+1 | javac | 0m 0s | the patch passed |
+1 | Release audit (RAT) | 0m 0s | the patch passed |
+1 | Validate source patterns | 0m 0s | the patch passed |
Other Tests | |||
0m 53s |
Subsystem | Report/Notes |
---|---|
JIRA Issue | LUCENE-8991 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12981459/LUCENE-8991.patch |
Optional Tests | compile javac unit ratsources validatesourcepatterns |
uname | Linux lucene1-us-west 4.15.0-54-generic #58-Ubuntu SMP Mon Jun 24 10:55:24 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | ant |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
git revision | master / ec9780c8aad |
ant | version: Apache Ant(TM) version 1.10.5 compiled on March 28 2019 |
Default Java | LTS |
Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/207/testReport/ |
modules | C: lucene U: lucene |
Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/207/console |
Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
[Legacy Jira: Lucene/Solr QA on Sep 26 2019]
I'll do you one better Erick...
In the updated patch, -da:java.util.HashMap
is used if and only if this appears to be openjdk based JVM with spec version 10 or 11 (same logic as used in the existing documentation-lint.supported
condition) and tests.asserts.hashmap
is "false" (or unset) ... so people can override it this logic with a single -Dtests.asserts.hashmap=true
.
which means on trunk today:
1) this line fails fairly reliably for me using openjdk 11.0.4...
ant test -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom -Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8
..but with the patch applied it passes reliably for me, indicating that the disabling the assertions on HashMap prevented the failure.
2) with the patch applied and the override prop added...
ant test -Dtestcase=TestCloudJSONFacetSKG -Dtests.method=testRandom -Dtests.seed=3136E77C0EDA0575 -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es-SV -Dtests.timezone=PST8PDT -Dtests.asserts=true -Dtests.file.encoding=UTF-8 -Dtests.asserts.hashmap=true
...it starts to fail reliably for me again, indicating that the override works
[Legacy Jira: Chris M. Hostetter on Sep 27 2019]
❌ -1 overall |
---|
Vote | Subsystem | Runtime | Comment |
---|---|---|---|
Prechecks | |||
-1 | test4tests | 0m 0s | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. |
master Compile Tests | |||
+1 | compile | 0m 0s | master passed |
Patch Compile Tests | |||
+1 | compile | 0m 0s | the patch passed |
+1 | javac | 0m 0s | the patch passed |
+1 | Release audit (RAT) | 0m 0s | the patch passed |
+1 | Validate source patterns | 0m 0s | the patch passed |
Other Tests | |||
0m 38s |
Subsystem | Report/Notes |
---|---|
JIRA Issue | LUCENE-8991 |
JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12981506/LUCENE-8991.patch |
Optional Tests | compile javac unit ratsources validatesourcepatterns |
uname | Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux |
Build tool | ant |
Personality | /home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh |
git revision | master / a9cf5f6 |
ant | version: Apache Ant(TM) version 1.9.6 compiled on July 20 2018 |
Default Java | LTS |
Test Results | https://builds.apache.org/job/PreCommit-LUCENE-Build/208/testReport/ |
modules | C: lucene U: lucene |
Console output | https://builds.apache.org/job/PreCommit-LUCENE-Build/208/console |
Powered by | Apache Yetus 0.7.0 http://yetus.apache.org |
This message was automatically generated.
[Legacy Jira: Lucene/Solr QA on Sep 27 2019]
Commit 10da07a396777e3e7cfb091c5dec826b6df11284 in lucene-solr's branch refs/heads/master from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=10da07a
LUCENE-8991: disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399
[Legacy Jira: ASF subversion and git services on Oct 02 2019]
Commit 60b9ec0866e6223afb269fa377203f731cca2973 in lucene-solr's branch refs/heads/branch_8x from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=60b9ec0
LUCENE-8991: disable java.util.HashMap assertions to avoid spurious vailures due to JDK-8205399
(cherry picked from commit 10da07a396777e3e7cfb091c5dec826b6df11284)
[Legacy Jira: ASF subversion and git services on Oct 03 2019]
This just popped up again in thetaphi_Lucene-Solr-8.x-Linux_1352.log.txt, using 1.8.0_201 which doesn't jive w/my original understanding of the issue...
But going back down the rabbit hole of JDK-8205399 (where the assertion failure is discussed/fixed) I see now that even though that issue says the problem started in java10, looking at the linked issue JDK-8186171 (where the change was made that actually caused the assertion to start failing) i see it has other linked issues indicating that the (broken) change was actually backported to "9u-open" and "8u201"
So i'm going to revisit this change and fix common-build.xml to check for any java less then 12.
[Legacy Jira: Chris M. Hostetter on Oct 07 2019]
Commit bc0652ecc09ba6b82d86a8050b541a96f2a0f888 in lucene-solr's branch refs/heads/master from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bc0652e
LUCENE-8991: disable HashMap assertions (by default) on java9 and java1.8 as well
[Legacy Jira: ASF subversion and git services on Oct 07 2019]
Commit 48e19ab4c3af42058ac29a92983163cdf8a147f0 in lucene-solr's branch refs/heads/branch_8x from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=48e19ab
LUCENE-8991: disable HashMap assertions (by default) on java9 and java1.8 as well
(cherry picked from commit bc0652ecc09ba6b82d86a8050b541a96f2a0f888)
[Legacy Jira: ASF subversion and git services on Oct 07 2019]
So i'm going to revisit this change and fix common-build.xml to check for any java less then 12.
ant doesn't make it easy to do "less than" and java8's spec version was "1.8", so an integer comparison wasn't going to work anyway ... so i just added "9" and "1.8" to the conditional "or".
tested again using java11 and java12 on master AND using java8 on branch8x, both with and w/o the new -Dtests.asserts.hashmap=true
[Legacy Jira: Chris M. Hostetter on Oct 07 2019]
Commit bc0652ecc09ba6b82d86a8050b541a96f2a0f888 in lucene-solr's branch refs/heads/jira/SOLR-13821 from Chris M. Hostetter https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=bc0652e
LUCENE-8991: disable HashMap assertions (by default) on java9 and java1.8 as well
[Legacy Jira: ASF subversion and git services on Oct 08 2019]
Closing after the 9.0.0 release
[Legacy Jira: Adrien Grand (@jpountz) on Dec 08 2021]
An incredibly common class of jenkins failure (at least in Solr tests) stems from triggering assertion failures in java.util.HashMap – evidently triggering bug JDK-8205399, first introduced in java-10, and fixed in java-12, but has never been backported to any java-10 or java-11 bug fix release...
https://bugs.openjdk.java.net/browse/JDK-8205399
SOLR-13653 tracks how this bug can affect Solr users, but I think it would make sense to disable java.util.HashMap in our build system to reduce the confusing failures when users/jenkins runs tests, since there is nothing we can do to work around this when testing with java-11 (or java-10 on branch_8x)
Legacy Jira details
LUCENE-8991 by Chris M. Hostetter on Sep 26 2019, resolved Oct 07 2019 Attachments: LUCENE-8991.patch (versions: 2) Linked issues: