opensearch-project / opensearch-sdk-py

OpenSearch Extensions SDK for Python.
https://opensearch.org/blog/introducing-extensions-for-opensearch/
Apache License 2.0
8 stars 7 forks source link

[BUG] Unable to register extension with OpenSearch version 2.15.0 #166

Closed mzainab24 closed 23 hours ago

mzainab24 commented 4 days ago

What is the bug?

I have a running distribution of OpenSearch version 2.15.0 which was installed using the method on the following link. https://opensearch.org/docs/latest/install-and-configure/install-opensearch/tar/ I enabled the experimental extension feature in the opensearch.yml file and then run the distribution. Then while trying to register the extension, it gave me version incompatibility error so I changed the versions in hello.json extension sample file.

Screenshot 2024-07-01 134227

After that, when I again tried to register the extension, it gave the following errors on the running OpenSearch and the registration was unsuccessful, as I could not get a response from the sample extension.

java.lang.IllegalStateException: Received message from unsupported version: [7.9.99] minimal compatible version is: [7.10.0]
        at org.opensearch.transport.TransportHandshaker$HandshakeResponseHandler.handleResponse(TransportHandshaker.java:189)
        at org.opensearch.transport.TransportHandshaker$HandshakeResponseHandler.handleResponse(TransportHandshaker.java:161)
        at org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:427)
        at org.opensearch.transport.NativeMessageHandler.handleResponse(NativeMessageHandler.java:419)
        at org.opensearch.transport.NativeMessageHandler.handleMessage(NativeMessageHandler.java:174)
        at org.opensearch.transport.NativeMessageHandler.messageReceived(NativeMessageHandler.java:126)
        at org.opensearch.transport.InboundHandler.messageReceivedFromPipeline(InboundHandler.java:121)
        at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:113)
        at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:800)
        at org.opensearch.transport.nativeprotocol.NativeInboundBytesHandler.forwardFragments(NativeInboundBytesHandler.java:157)
        at org.opensearch.transport.nativeprotocol.NativeInboundBytesHandler.doHandleBytes(NativeInboundBytesHandler.java:94)
        at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:143)
        at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:119)
        at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1407)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.base/java.lang.Thread.run(Thread.java:1583) 

How can one reproduce the bug?

  1. Install OpenSearch as mentioned in the following link. https://opensearch.org/docs/latest/install-and-configure/install-opensearch/tar/
  2. Open opensearch.yml. Search for opensearch.experimental.feature.extensions.enabled, uncomment it, and set it to true.
  3. Run the opensearch binary.
  4. Run the sample hello extension as explained in this guide. https://github.com/opensearch-project/opensearch-sdk-py/tree/main/samples/hello
  5. Register the extension.

What is the expected behavior?

Error logs on OpenSearch.

java.lang.IllegalStateException: Received message from unsupported version: [7.9.99] minimal compatible version is: [7.10.0]
        at org.opensearch.transport.TransportHandshaker$HandshakeResponseHandler.handleResponse(TransportHandshaker.java:189)
        at org.opensearch.transport.TransportHandshaker$HandshakeResponseHandler.handleResponse(TransportHandshaker.java:161)
        at org.opensearch.transport.NativeMessageHandler.doHandleResponse(NativeMessageHandler.java:427)
        at org.opensearch.transport.NativeMessageHandler.handleResponse(NativeMessageHandler.java:419)
        at org.opensearch.transport.NativeMessageHandler.handleMessage(NativeMessageHandler.java:174)
        at org.opensearch.transport.NativeMessageHandler.messageReceived(NativeMessageHandler.java:126)
        at org.opensearch.transport.InboundHandler.messageReceivedFromPipeline(InboundHandler.java:121)
        at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:113)
        at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:800)
        at org.opensearch.transport.nativeprotocol.NativeInboundBytesHandler.forwardFragments(NativeInboundBytesHandler.java:157)
        at org.opensearch.transport.nativeprotocol.NativeInboundBytesHandler.doHandleBytes(NativeInboundBytesHandler.java:94)
        at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:143)
        at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:119)
        at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1407)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:918)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:994)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.base/java.lang.Thread.run(Thread.java:1583) 

What is your host/environment?

Distributor ID: Ubuntu Description: Ubuntu 22.04.4 LTS Release: 22.04 Architecture: x86-64

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

dbwiddis commented 4 days ago

Hey @mzainab24 thanks for this report. It's been a while since I mucked with this code, but it looks like the transport handshakes use the 3.0.0 version: https://github.com/opensearch-project/opensearch-sdk-py/blob/29d98e603fc32b48fcc7b5fba0e091bce2b9f12e/src/opensearch_sdk_py/transport/version.py#L11

I took a quick look at the Version Class in OpenSearch and it's a bit confusing to me how all that math works out.

But I'd suggest as a first try, set that Version constant in this extension to 2150000 and see if that resolves the issue.

dbwiddis commented 4 days ago

OK, spent a little bit more time looking into this. I think this is indeed a bug, specifically: this extension uses the same version in the response as it received in the request: https://github.com/opensearch-project/opensearch-sdk-py/blob/29d98e603fc32b48fcc7b5fba0e091bce2b9f12e/src/opensearch_sdk_py/actions/internal/transport_handshake_request_handler.py#L49

The version being sent, however, is a "minCompatVersion" rather than the current one.

            } else if (version.onOrAfter(Version.V_2_0_0)) {
                minCompatVersion = Version.fromId(7099999);
            }
            handshakeRequestSender.sendRequest(node, channel, requestId, minCompatVersion);

This runs afoul of 2.x version checking which requires 7.10.0 minimum. That version checking happened to work on 3.x but doesn't work on 2.x.

TLDR, we should not use the inbound "min compatibility" in our response.

I think if we pass None for the version in the above request handler code, it will default to the current version (3.0.0) and should work.

dblock commented 4 days ago

@dbwiddis Isn't there an easier workaround for @mzainab24 in the extension itself?

dbwiddis commented 4 days ago

Isn't there an easier workaround for @mzainab24 in the extension itself?

Nothing really configurable, this is our hardcoded TCP handshake. We just sent back the same version which worked on 3.x so I never really dug into why it worked and assumed it would. It's clearly not the same thing.

I believe the bugfix is probably to change this line to something else, probably Version.CURRENT https://github.com/opensearch-project/opensearch-sdk-py/blob/29d98e603fc32b48fcc7b5fba0e091bce2b9f12e/src/opensearch_sdk_py/actions/internal/transport_handshake_request_handler.py#L49

I haven't had any time to set it up to test yet.

mzainab24 commented 3 days ago

@dbwiddis and @dblock thank you for the insights. I am currently using the opensearch-sdk-py as a Python library and building my extension on top of it. Could I inquire if there are any upcoming commits that might address this issue?

Another thing to mention is that the extension registration works perfectly fine when I clone the whole code of OpenSearch, the latest version 2.15.0 from https://github.com/opensearch-project/OpenSearch.git, and in that way I do not have to change the below version in the json file. "opensearchVersion":"3.0.0", "minimumCompatibleVersion":"3.0.0" to 2.15.0 and it works. But when I install OpenSearch from a tarball, using the Advanced Packaging Tool (APT) package manager, or using docker to configure and manage OpenSearch, then I receive the above mentioned error. The OpenSearch version numbers differ for all of the above: Clone the code from github: version number = 3.0.0-SNAPSHOT tar, apt and docker = 2.15.0 I want to avoid cloning the code and running the distribution from it to simplify my deployments.

dbwiddis commented 3 days ago

@mzainab24 I can certainly make a PR to change that version number, but as mentioned I haven't had time to test it so it's an informed guess that it may fix the problem. I'll make that commit now.

dbwiddis commented 1 day ago

@dblock @mzainab24 FYI, I just spent some time digging through all the related Java code (again) and it's a bit more complex. The proposed PR fix might work as a quick unblocker but is not the right long term fix, which I'm still trying to wrap my head around.

Key point is that there are two different version numbers being exchanged in these handshakes.

Here's the inbound bytes from the handshake header from a 3.0.0 cluster:

#b'ES\x00\x00\x001\x00\x00\x00\x00\x00\x00\x00\x08\x08\x08 \x0b\x83\x00\x00\x00\x1a\x00\x00\x00\x16internal:tcp/handshake\x00\x04\xa3\x8e\xb7A'

It's easier to read with Netty debug. Here's just the header part.

         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 45 53 00 00 00 31 00 00 00 00 00 00 00 08 08 08 |ES...1..........|
|00000010| 20 0b 83 00 00 00 1a                            |......          |
+--------+-------------------------------------------------+----------------+

The 08 20 0b 83 starting at byte 15, after stripping the mask bit, is 0x200b83 or 2100099. This is the "minimum compatibility version". The handshake was sent from a 3.x cluster, so that was the latest minor version (2.10.0) at the time that was logged.

Here's the full handshake:

         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 45 53 00 00 00 31 00 00 00 00 00 00 00 08 08 08 |ES...1..........|
|00000010| 20 0b 83 00 00 00 1a 00 00 00 16 69 6e 74 65 72 | ..........inter|
|00000020| 6e 61 6c 3a 74 63 70 2f 68 61 6e 64 73 68 61 6b |nal:tcp/handshak|
|00000030| 65 00 04 a3 8e b7 41                            |e.....A         |
+--------+-------------------------------------------------+----------------+

The a3 8e b7 41 at the end is OpenSearch Vint type, so is read in little-endian order 7 bits at a time. After stripping the mask bit, it's 0x2DC723 or 3000099. That's the version of OpenSearch sending the request.

As an extension we are not supposed to care about compatibility, and since the transport layer hasn't changed, we don't have to.

So we can send back the same version (which we tried to do) as long as we distinguish between the two types.

The bug here is that we were sending back the minimum compatible version in both fields. This happened to work on 3.x because the minimum compatible version being sent was the lastest 2.x version (2.10.0) in this case so the extension was responding "Hey, I'm running version 2.10.0 and that's my minimum compatible!" Except when we connect from a 2.x cluster, the min compat is (ES) 7.9 and we're reporting that as our minimum compatible (OK) and our current version (not OK, as OpenSearch 2.x is only compatible with the latest (7.10) previous minor version.

TLDR: the PR fix as it is, is wrong.

  1. The version argument to OutboundMessageRequest should be the min compat version (which in the current code is ES 7.9 and it's ok to leave it like that)
  2. The message argument to OutboundMessageRequest is a TransportHandshakerHandshakeRequest which should contain the current version, or the current version that sent the request, either will work.

So the only change needed is

            TransportHandshakerHandshakeResponse(Version(Version.CURRENT)),

but we can also hardcode the min compat version to 7.9 if we want, which I think I'll do...

I'll build in some more tests and docs this weekend.