Node fails to perform data write on replica shards while upgrading to OS 2.12

Dhruvan1217 commented 6 months ago

What is the bug?

While upgrading the Opensearch cluster from 1.3.9 to 2.12, the shards are failing to recover. The data is unable to replicate to the replicas with following exception in the logs.

NOTE: This issue was also showing up in 2.11.x and was expected to be fixed in 2.12.0, Reference: https://github.com/opensearch-project/OpenSearch/issues/11491

In an on-going in-place upgrade (where certain nodes are on 1.3 and other nodes on 2.12), I have noticed this happening when the primary shard assigned on new/upgraded node is trying to replicate the data to the replica assigned to other upgraded node.

[2024-02-28T06:56:09,731][WARN ][org.opensearch.action.bulk.TransportShardBulkAction] [[Index_x_y_z][5]] failed to perform indices:data/write/bulk[s] on replica [Index_x_y_z][5], node[WGf0Uf3fSBC9Cvu8a0OwFg], [R], s[STARTED], a[id=tkTI2NBcQGKedZRRAi6_Hg]
org.opensearch.transport.RemoteTransportException: [sysdigcloud-elasticsearch-1][p.q.r.s:9300][indices:data/write/bulk[s][r]]
Caused by: org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: -84 is not a valid id
    at org.opensearch.security.support.Base64CustomHelper.deserializeObject(Base64CustomHelper.java:136) ~[?:?]
    at org.opensearch.security.support.Base64Helper.deserializeObject(Base64Helper.java:46) ~[?:?]
    at org.opensearch.security.transport.SecurityRequestHandler.messageReceivedDecorate(SecurityRequestHandler.java:187) ~[?:?]
    at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:154) ~[?:?]
    at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:795) ~[?:?]
    at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.transport.InboundHandler.handleRequest(InboundHandler.java:271) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:144) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:127) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) [opensearch-2.12.0.jar:2.12.0]
    at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) [transport-netty4-client-2.12.0.jar:2.12.0]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
    at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.106.Final.jar:4.1.106.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.106.Final.jar:4.1.106.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.106.Final.jar:4.1.106.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.106.Final.jar:4.1.106.Final]
    at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[netty-codec-4.1.106.Final.jar:4.1.106.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.106.Final.jar:4.1.106.Final]
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) [netty-handler-4.1.106.Final.jar:4.1.106.Final]

How can one reproduce the bug? Steps to reproduce the behavior: 1) Setup a 1.3.x cluster (consider 3 node cluster) 2) Ingest some data to indices and let the data continue to flow in (possibly also create new indices in-between node restarts) 3) Set shard allocation to primaries 4) Start in-place upgrade to 2.12.0 by upgrading the nodes one by one 5) Reboot the first node, once it is initialized, let the cluster turn green (by setting allocation to all) before restarting the following nodes. 6) Set allocation to primaries again and Reboot the second node. 7) As soon as the second node is initialised and then the allocation is set to all, if a replica shard comes to this node for which the primary shard is on the first node upgraded, you will see errors in the logs when the first node tries to write/replicate the data to the replica shard.

What is the expected behavior? The data write to shard replicas should be successful.

What is your host/environment?

OS: AMD64

Plugins

$ /usr/share/opensearch/bin/opensearch-plugin list
opensearch-index-management
opensearch-job-scheduler
opensearch-security
repository-gcs
repository-s3

Dhruvan1217 commented 6 months ago

@cwperks Tagging you if you can take a look at this further https://github.com/opensearch-project/OpenSearch/issues/11491

peternied commented 6 months ago

This issue is in performance-analyzer, transferring to that repo

cwperks commented 6 months ago

When the second node is being rebooted, could there be an inflight transport request that gets resumed when the second node is brought back up as a 2.12 node?

I'd have to dig into it further, but this line determines what serialization to use when sending a transport request. It could be the case that the first node that was replaced with a 2.12 node is in the middle of sending a request to the second 1.3 node that is rebooted before replying back to the 2.12 node. When the node comes back online, could it be that the transport request is being replayed?

Dhruvan1217 commented 6 months ago

@peternied We don't have PA enabled, this is happening without that plugin and this seems to be generic security module problem. We should move it back to security repo. Thanks

peternied commented 6 months ago

@opensearch-project/admin Please transfer this issue to the security repo.

stephen-crawford commented 6 months ago

[Triage] Hi @Dhruvan1217, thanks for filing this issue and providing detailed reproduction steps. Someone will take a look and see what the problem is and note steps forward.

Dhruvan1217 commented 6 months ago

Also AFAIU, there were no issues in Opensearch 2.10, so maybe we can take a look at what was changed there after (I believe introduction of custom serializartion/de-serialization)

opensearch-project / security

Node fails to perform data write on replica shards while upgrading to OS 2.12 #4085