opensearch-project / flow-framework

OpenSearch plugin that enables builders to innovate AI apps on OpenSearch
Apache License 2.0
33 stars 36 forks source link

[BUG] Failing tests for JDK 21.0.2 #426

Closed owaiskazi19 closed 10 months ago

owaiskazi19 commented 10 months ago

What is the bug?

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.flowframework.rest.FlowFrameworkSecureRestApiIT.testCreateProvisionDeprovisionWorkflowWithFullAccess" -Dtests.seed=37D0F3650C59B32F -Dtests.security.manager=false -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-VE -Dtests.timezone=Indian/Kerguelen -Druntime.java=21
org.opensearch.flowframework.rest.FlowFrameworkSecureRestApiIT > testCreateProvisionDeprovisionWorkflowWithFullAccess FAILED
    java.net.SocketTimeoutException: 60.000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
        at __randomizedtesting.SeedInfo.seed([37D0F3650C59B32F:BE62CDF6F65B5600]:0)
        at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:937)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:333)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:116)
        at org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:89)
        at org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:78)
        at org.opensearch.flowframework.FlowFrameworkRestTestCase.deprovisionWorkflow(FlowFrameworkRestTestCase.java:351)
        at org.opensearch.flowframework.rest.FlowFrameworkSecureRestApiIT.testCreateProvisionDeprovisionWorkflowWithFullAccess(FlowFrameworkSecureRestApiIT.java:127)

        Caused by:
        java.net.SocketTimeoutException: 60.000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
            at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
            at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:98)
            at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:40)
            at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:506)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
            at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
            at java.base/java.lang.Thread.run(Thread.java:1583)

Tests are failing after https://github.com/opensearch-project/flow-framework/pull/421 got merged.

See the instances for more logs:

  1. https://github.com/opensearch-project/flow-framework/actions/runs/7590930291/job/20678264507?pr=425
  2. https://github.com/opensearch-project/flow-framework/actions/runs/7590847338/job/20678051272

How can one reproduce the bug?

  1. Run security enabled integ tests on JDK 21. ./gradlew integTest -Dsecurity.enabled=true

What is the expected behavior?

Test should pass

What is your host/environment?

ubuntu-latest

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

dbwiddis commented 10 months ago

Tests were failing on JDK21 before that PR was even submitted: https://github.com/opensearch-project/flow-framework/actions/workflows/test_security.yml

See for example this run prior to the #421 PR: https://github.com/opensearch-project/flow-framework/actions/runs/7589890744

First of the recent failures was on the initial PR for this bug fix: https://github.com/opensearch-project/flow-framework/actions/runs/7555230555

But it passed later from that same PR: https://github.com/opensearch-project/flow-framework/actions/runs/7561855053

And then started failng again for this PR that didn't even change the code: https://github.com/opensearch-project/flow-framework/actions/runs/7577073358

So it's flaky but don't assume it's a recent change.

dbwiddis commented 10 months ago

Have investigated multiple possibilities:

Logs appear to show the root cause is authentication:

? WARN ][o.o.d.FileBasedSeedHostsProvider] [integTest-0] expected, but did not find, a dynamic hosts list at [/__w/flow-framework/flow-framework/build/testclusters/integTest-0/config/unicast_hosts.txt]
? WARN ][o.o.s.a.BackendRegistry  ] [integTest-0] Authentication finally failed for null from 127.0.0.1:55912
? WARN ][o.o.s.a.BackendRegistry  ] [integTest-0] Authentication finally failed for null from 127.0.0.1:55914
? WARN ][o.o.s.a.BackendRegistry  ] [integTest-0] Authentication finally failed for null from 127.0.0.1:55922
<many more>

One big clue: tests went from all green to flaky to all failing, only on JDK21. Investigated JDK version on the failed runners:

Will try a different distro....

dbwiddis commented 10 months ago

Ran a matrix with temurin and zulu, 21.0.1 and 21.0.2. Both distros passed on 21.0.1, both distros failed on 21.0.2.

Possible bug in JDK 21.0.2. Tagging @sormuras to see if he has any idea what this could be.

Near term fix: update security integ test matrix to specify 21.0.1 version instead of just 21. @owaiskazi19 can you include this in one of your open PRs?

dbwiddis commented 10 months ago

Here's the 21.0.2 release notes: https://www.oracle.com/java/technologies/javase/21-0-2-relnotes.html

Some java.nio ones might be suspicious, but I'm way out of my depth here...

AlanBateman commented 10 months ago

@dbwiddis Would it be possible to provide instructions on how to reproduce this? It is just a matter of cloning this repo and running the tests? Asking so that folks working on the JDK can quickly see if this is a JDK bug or not. Maybe you could come to the OpenJDK net-dev mailing list with instructions so this potential regression can be diagnosed? It would also be useful to know which operating system and whether it duplicates with JDK 22 and JDK 23 EA builds.

sormuras commented 10 months ago

Skimmed the issue tracker for 21.0.2 for socket entries: https://bugs.openjdk.org/issues/?jql=project%20%3D%20JDK%20AND%20fixVersion%20%3D%2021.0.2%20AND%20text%20~%20%22socket%22 ... seems unrelated to me.

dbwiddis commented 10 months ago

@dblock @reta @nknize Just an FYI, nothing actionable yet

If you start to see this pop up elsewhere, wanted you to be aware.

We're using performRequest() via Rest High Level Client as part of our integ tests, and something broke with the latest JDK patch version (released Jan 16, has made its way to GHA runners by now), only on the security-enabled tests, where those client requests are failing auth.

reta commented 10 months ago

@dblock @reta @nknize Just an FYI, nothing actionable yet

Thanks @dbwiddis , the issue was brought up and listed as risk here https://github.com/opensearch-project/OpenSearch/issues/11906, we haven't done any updates yet in our codebase but you are very right, the GA may already use the latest version, we probably should do 2 things:

  1. Use 21.0.1 as the version (GA supports that, large change but not much we can do)
  2. Ban 21.0.2 from OpenSearch runtimes (we do have mechanism implemented, I will follow up with the issue)
dbwiddis commented 10 months ago

@owaiskazi19 @joshpalis @amitgalitz @ohltyler @jackiehanyang FYI:

JDK issue in question: https://bugs.openjdk.org/browse/JDK-8323659 (technically correct but not backwards compatible behavior)

AlanBateman commented 10 months ago

@dbwiddis Attempts to duplicate this so far have failed. Would it be possible to come to the OpenJDK net-dev mailing list? If there is a regression in 21.0.2 then we'd like to track it down.

dbwiddis commented 10 months ago

@AlanBateman we believe this is an impact of https://bugs.openjdk.org/browse/JDK-8323659

dbwiddis commented 10 months ago

@AlanBateman Failure has occurred independently from this repo in another repo's integration tests: https://github.com/opensearch-project/skills/actions/runs/7617744945/job/20747440932?pr=140

The commonality in the logs is an entry

[2024-01-22T14:59:53,046][INFO ][o.o.i.NeuralSparseSearchToolIT] [testNeuralSparseSearchToolInFlowAgent] There are still tasks running after this test that might break subsequent tests [indices:data/read/search, indices:data/read/search[phase/query], indices:data/write/bulk, indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/index, indices:data/write/update, indices:data/write/update[s]].

It occurs with both fixed and scaling thread pools when the queue is invoked, which is a LinkedTransferQueue() which matches https://bugs.openjdk.org/browse/JDK-8323659

So I'm fairly certain no further debugging is required. I'll leave it up to you to assess the impact and when to release the fix.