opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.68k stars 1.79k forks source link

[BUG] Index Management Plugin Integration Tests Are Failing After Core Change #1473

Closed downsrob closed 2 years ago

downsrob commented 2 years ago

Parallel issue in opensearch-build: https://github.com/opensearch-project/opensearch-build/issues/953 Describe the bug All tests under MetadataRegressionIT fail for Index Management main with the following stack trace:

java.lang.IllegalStateException: Message not fully read (request) for requestId [77], action [cluster:monitor/nodes/stats], available [9]; resetting
        at __randomizedtesting.SeedInfo.seed([FF2514888B92F199:59182A48218AD972]:0)
        at org.opensearch.transport.InboundHandler.handleRequest(InboundHandler.java:215)
        at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:120)
        at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102)
        at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:713)
        at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:155)
        at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:130)
        at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:95)
        at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:87)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:832)

This same failure occurs locally, and on the Github CI runner. Additionally, this failure occurs in any tests on index management which extend from OpenSearchIntegTestCase. Currently MetadataRegressionIT is the only class which does, causing this isolated failure. The tests behave normally, but then fail during cleanup. Here is a failing build run.

To Reproduce Steps to reproduce the behavior: On index management main - REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT.*"

Expected behavior The tests should pass, and clean up without issue.

Additional context Currently, index management's integration tests which extend from OpenSearchIntegTestCase are all failing after this shard indexing pressure PR was backported to 1.x. Initial debugging made it seem that the cause was that our tests were not always initializing the cluster service, and the shard indexing pressure change had an implicit dependency on this behavior. A fix was introduced to remove that dependency by including the shard indexing pressure stats whenever the version is on or after 1.2. This change did not fix the test failures, however. The failure can be tracked down to the afterInternal function called from cleanUpCluster at the end of OpenSearchIntegTestCase.java. In this function, cluster().assertAfterTest() is called, which, as our integration tests use the external test cluster, triggers ensureEstimatedStats() in ExternalTestCluster.java. When the nodeStats are requested here to confirm that the cluster successfully cleaned up, this request triggers the streaming error which is causing the tests to fail.

Removing the additions to the NodeStats stream in the NodeStats.java file and the CommonStatsFlags.java from this recent PR enable these tests to pass.

Changing the test distribution in the build.gradle file to INTEG_TEST instead of ARCHIVE also enables the tests to pass.

After running the integration tests in index management, if I look through the opensearch-1.2.0-SNAPSHOT.jar in build/testclusters/integTest-0/distro/1.2.0-Archive/lib, I do not see the NodeStats and CommonStatsFlags changes, so it may be the case that an out of date Opensearch 1.2.0 archive snapshot could be causing a discrepancy in the NodeStats API signature.

adnapibar commented 2 years ago

@downsrob is this still an issue? I tried running the tests and everything's passing, can you confirm?

adnapibar commented 2 years ago

@downsrob is this still an issue? I tried running the tests and everything's passing, can you confirm?

The tests are currently suppressed with awaitsFix to get CI working. The tests still fail otherwise.

adnapibar commented 2 years ago

Stack trace of one of the failing tests

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT.test move metadata service" -Dtests.seed=E194FF25655D5338 -Dtests.security.manager=false -Dtests.locale=ga-IE -Dtests.timezone=BET -Druntime.java=14

org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT > test move metadata service FAILED
    TransportSerializationException[Failed to deserialize response from handler [org.opensearch.transport.TransportService$ContextRestoreResponseHandler/org.opensearch.action.ActionListenerResponseHandler@14f300dd/org.opensearch.client.transport.TransportClientNodesService$RetryListener@58571732]]; nested: EOFException;
        at __randomizedtesting.SeedInfo.seed([E194FF25655D5338:43E7FA64742871E1]:0)
        at org.opensearch.transport.InboundHandler.handleResponse(InboundHandler.java:301)
        at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:154)
        at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:108)
        at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:759)
        at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:170)
        at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:145)
        at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:110)
        at org.opensearch.transport.nio.MockNioTransport$MockTcpReadWriteHandler.consumeReads(MockNioTransport.java:341)
        at org.opensearch.nio.SocketChannelContext.handleReadBytes(SocketChannelContext.java:246)
        at org.opensearch.nio.BytesChannelContext.read(BytesChannelContext.java:59)
        at org.opensearch.nio.EventHandler.handleRead(EventHandler.java:152)
        at org.opensearch.transport.nio.TestEventHandler.handleRead(TestEventHandler.java:167)
        at org.opensearch.nio.NioSelector.handleRead(NioSelector.java:438)
        at org.opensearch.nio.NioSelector.processKey(NioSelector.java:264)
        at org.opensearch.nio.NioSelector.singleLoop(NioSelector.java:191)
        at org.opensearch.nio.NioSelector.runLoop(NioSelector.java:148)
        at java.base/java.lang.Thread.run(Thread.java:832)

        Caused by:
        java.io.EOFException
            at org.opensearch.common.io.stream.InputStreamStreamInput.readByte(InputStreamStreamInput.java:71)
            at org.opensearch.common.io.stream.FilterStreamInput.readByte(FilterStreamInput.java:53)
            at org.opensearch.common.io.stream.StreamInput.readVInt(StreamInput.java:223)
            at org.opensearch.common.io.stream.StreamInput.readArraySize(StreamInput.java:1298)
            at org.opensearch.common.io.stream.StreamInput.readCollection(StreamInput.java:1227)
            at org.opensearch.common.io.stream.StreamInput.readList(StreamInput.java:1184)
            at org.opensearch.action.support.nodes.BaseNodesResponse.<init>(BaseNodesResponse.java:58)
            at org.opensearch.action.admin.cluster.node.stats.NodesStatsResponse.<init>(NodesStatsResponse.java:51)
            at org.opensearch.action.ActionListenerResponseHandler.read(ActionListenerResponseHandler.java:82)
            at org.opensearch.action.ActionListenerResponseHandler.read(ActionListenerResponseHandler.java:49)
            at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.read(TransportService.java:1338)
            at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.read(TransportService.java:1325)
            at org.opensearch.transport.InboundHandler.handleResponse(InboundHandler.java:298)
            ... 16 more
getsaurabh02 commented 2 years ago

The Node Stats Request parameters are preserved under the version check and this shouldn’t be the cause of this issue directly. These serialization constructs are a common practice and are already well tested under Open Search Integration and Backward Compatibility Tests, which ensures safe node communication for homogeneous and heterogeneous clusters. Here it seems to be related to the test setup.

Error here shows exception in parsing of the NodesStatsResponse (and not the Request). Reason it is throwing EOFException exception is that the Node Stats Response is not complete, as expected by the receiver, as some fields are missing. These would essentially be the new stats field, which are missing. Wondering if this is related to the problem as stated by @downsrob in the description:

so it may be the case that an out of date Opensearch 1.2.0 archive snapshot could be causing a discrepancy in the NodeStats API signature.
anasalkouz commented 2 years ago

Hi @downsrob, do you have any update on the test failures? is this a blocker for 1.2 release?

downsrob commented 2 years ago

@anasalkouz These tests are still failing, though we have added the AwaitsFix annotation to enable our CI to pass. At this point it seems like a test failure only, and is not a blocker for 1.2.

nknize commented 2 years ago

Wondering if this is related to the problem as stated by @downsrob in the description:

I believe SNAPSHOT binaries are built every evening? (@dblock or @peterzhuamazon can confirm). I also think the plugin pick up the latest snapshot when running the test. This is definitely a serialization mismatch somewhere in the transport layer. If it's the result of a PR not being backported we need to find that PR and backport it. Once the fix is merged the plugin tests should pass after the new SNAPSHOT binaries are built.

is this a blocker for 1.2 release?

Absolutely. It's a shame it wasn't labeled as such instead of just muting the test 12 days ago and not following up to fix the issue.

downsrob commented 2 years ago

This error only seems to occur when the test distribution is set to ARCHIVE, and the gradle task is integTest. If the test distribution it is set to INTEG_TEST then the normal snapshots are used then the test will pass.

I have been able to replicate this error on multiple plugins, Index Management, Alerting, and Anomaly Detection. Here is the code for an empty test to replicate:

import org.opensearch.test.OpenSearchIntegTestCase;

public class EmptyTestsIT extends OpenSearchIntegTestCase {

    public void testEmpty() {
        System.out.println("This does print, we fail in cleanup");
    }
}

Testing it with "./gradlew integTest --tests " will result in the same failure as documented above. The failure occurs during OpenSearchIntegTestCase cleanup, that is why an empty test can cause this.

However, if the test file is renamed to just EmptyTests and run with the 'test' task, rather than the 'integTest' task, meaning "./gradlew test --tests " then it passes, so the failure seems to be an interaction between the test setup for the integTest task and the OpenSearchIntegTestCase triggered cleanup occurring on the ExternalTestCluster, specifically with the archive test distribution.

getsaurabh02 commented 2 years ago

Thanks for sharing the details @downsrob, I was able to however successfully run the below empty test case as you had mentioned in the OpenSource code base. It did not run into any issues related to the cleanup. Also other ITs and bwc tests in OS which depend upon the abstract offerings of OpenSearchIntegTestCaseare running through. So still wondering what is specific here with the plugin that is resulting into the above error in cleanup for empty tests.

package org.opensearch.index;

import org.opensearch.test.OpenSearchIntegTestCase;

public class EmptyTestIT extends OpenSearchIntegTestCase {

    public void testEmpty() {
        System.out.println("This does print, we fail in cleanup");
    }
}

Just to understand the difference where distribution is set to ARCHIVE (where it fails) vs INTEG_TEST (where it succeeds), for the INTEG_TEST the normal snapshots are used. Here, does the normal means current or local, while the archive means from the store? Trying to isolate the problem to the specific setup where we are seeing this, to help identify on the cause here.

adnapibar commented 2 years ago

Instead of EOF now I'm seeing a message not fully read error.

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT.test move metadata service" -Dtests.seed=5C0621489E2AEF86 -Dtests.security.manager=false -Dtests.locale=ko -Dtests.timezone=CNT -Druntime.java=14

org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT > test move metadata service FAILED
    java.lang.IllegalStateException: Message not fully read (request) for requestId [77], action [cluster:monitor/nodes/stats], available [9]; resetting
        at __randomizedtesting.SeedInfo.seed([5C0621489E2AEF86:FE7524098F5FCD5F]:0)
        at org.opensearch.transport.InboundHandler.handleRequest(InboundHandler.java:215)
        at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:120)
        at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:102)
        at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:713)
        at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:155)
        at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:130)
        at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:95)
        at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:87)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:271)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:719)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:620)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:583)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:834)
adnapibar commented 2 years ago

@downsrob and I looked into the distribution that is being downloaded for integration tests and it is out of date. The deserialization failure is due to the new fields missing.

downsrob commented 2 years ago

@getsaurabh02 A lot of this is guessing, but from what I have pieced together it looks like when the test distribution is set to ARCHIVE, it seems to use a different artifact from the store. These artifacts have additional modules installed which are needed for plugins. These artifacts can also be used for backwards compatibility testing, and so they are meant to be stable and we don't build them nightly, but maintain them for backwards compatibility tests in the future. For plugins, when we use the archive the test distribution on version 1.2, I would guess that we get an out of date distribution which is still technically in version 1.2. The additional fields added in the breaking PR are conditionally added when the stream version is on or after 1.2, so the new fields would be serialized but the archive would not be able to read it.

peterzhuamazon commented 2 years ago

This is now fixed by sending https://github.com/opensearch-project/opensearch-infra/pull/131. Thanks.

adnapibar commented 2 years ago

Closing this issue as no fix is required in OpenSearch.

dblock commented 2 years ago

How do we prevent this problem in the future? Distribution downloader could maybe throw warnings on snapshot builds for artifacts older than X?