Closed owaiskazi19 closed 10 months ago
Tests were failing on JDK21 before that PR was even submitted: https://github.com/opensearch-project/flow-framework/actions/workflows/test_security.yml
See for example this run prior to the #421 PR: https://github.com/opensearch-project/flow-framework/actions/runs/7589890744
First of the recent failures was on the initial PR for this bug fix: https://github.com/opensearch-project/flow-framework/actions/runs/7555230555
But it passed later from that same PR: https://github.com/opensearch-project/flow-framework/actions/runs/7561855053
And then started failng again for this PR that didn't even change the code: https://github.com/opensearch-project/flow-framework/actions/runs/7577073358
So it's flaky but don't assume it's a recent change.
Have investigated multiple possibilities:
Logs appear to show the root cause is authentication:
? WARN ][o.o.d.FileBasedSeedHostsProvider] [integTest-0] expected, but did not find, a dynamic hosts list at [/__w/flow-framework/flow-framework/build/testclusters/integTest-0/config/unicast_hosts.txt]
? WARN ][o.o.s.a.BackendRegistry ] [integTest-0] Authentication finally failed for null from 127.0.0.1:55912
? WARN ][o.o.s.a.BackendRegistry ] [integTest-0] Authentication finally failed for null from 127.0.0.1:55914
? WARN ][o.o.s.a.BackendRegistry ] [integTest-0] Authentication finally failed for null from 127.0.0.1:55922
<many more>
One big clue: tests went from all green to flaky to all failing, only on JDK21. Investigated JDK version on the failed runners:
Will try a different distro....
Ran a matrix with temurin and zulu, 21.0.1 and 21.0.2. Both distros passed on 21.0.1, both distros failed on 21.0.2.
Possible bug in JDK 21.0.2. Tagging @sormuras to see if he has any idea what this could be.
Near term fix: update security integ test matrix to specify 21.0.1 version instead of just 21. @owaiskazi19 can you include this in one of your open PRs?
Here's the 21.0.2 release notes: https://www.oracle.com/java/technologies/javase/21-0-2-relnotes.html
Some java.nio ones might be suspicious, but I'm way out of my depth here...
@dbwiddis Would it be possible to provide instructions on how to reproduce this? It is just a matter of cloning this repo and running the tests? Asking so that folks working on the JDK can quickly see if this is a JDK bug or not. Maybe you could come to the OpenJDK net-dev mailing list with instructions so this potential regression can be diagnosed? It would also be useful to know which operating system and whether it duplicates with JDK 22 and JDK 23 EA builds.
Skimmed the issue tracker for 21.0.2
for socket
entries: https://bugs.openjdk.org/issues/?jql=project%20%3D%20JDK%20AND%20fixVersion%20%3D%2021.0.2%20AND%20text%20~%20%22socket%22 ... seems unrelated to me.
@dblock @reta @nknize Just an FYI, nothing actionable yet
If you start to see this pop up elsewhere, wanted you to be aware.
We're using performRequest()
via Rest High Level Client as part of our integ tests, and something broke with the latest JDK patch version (released Jan 16, has made its way to GHA runners by now), only on the security-enabled tests, where those client requests are failing auth.
@dblock @reta @nknize Just an FYI, nothing actionable yet
Thanks @dbwiddis , the issue was brought up and listed as risk here https://github.com/opensearch-project/OpenSearch/issues/11906, we haven't done any updates yet in our codebase but you are very right, the GA may already use the latest version, we probably should do 2 things:
21.0.1
as the version (GA supports that, large change but not much we can do)21.0.2
from OpenSearch runtimes (we do have mechanism implemented, I will follow up with the issue)@owaiskazi19 @joshpalis @amitgalitz @ohltyler @jackiehanyang FYI:
JDK issue in question: https://bugs.openjdk.org/browse/JDK-8323659 (technically correct but not backwards compatible behavior)
@dbwiddis Attempts to duplicate this so far have failed. Would it be possible to come to the OpenJDK net-dev mailing list? If there is a regression in 21.0.2 then we'd like to track it down.
@AlanBateman we believe this is an impact of https://bugs.openjdk.org/browse/JDK-8323659
@AlanBateman Failure has occurred independently from this repo in another repo's integration tests: https://github.com/opensearch-project/skills/actions/runs/7617744945/job/20747440932?pr=140
The commonality in the logs is an entry
[2024-01-22T14:59:53,046][INFO ][o.o.i.NeuralSparseSearchToolIT] [testNeuralSparseSearchToolInFlowAgent] There are still tasks running after this test that might break subsequent tests [indices:data/read/search, indices:data/read/search[phase/query], indices:data/write/bulk, indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/index, indices:data/write/update, indices:data/write/update[s]].
It occurs with both fixed and scaling thread pools when the queue is invoked, which is a LinkedTransferQueue()
which matches https://bugs.openjdk.org/browse/JDK-8323659
So I'm fairly certain no further debugging is required. I'll leave it up to you to assess the impact and when to release the fix.
What is the bug?
Tests are failing after https://github.com/opensearch-project/flow-framework/pull/421 got merged.
See the instances for more logs:
How can one reproduce the bug?
./gradlew integTest -Dsecurity.enabled=true
What is the expected behavior?
Test should pass
What is your host/environment?
ubuntu-latest
Do you have any screenshots?
If applicable, add screenshots to help explain your problem.
Do you have any additional context?
Add any other context about the problem.