Closed Jakob3xD closed 1 month ago
Previous issue linked to https://github.com/opensearch-project/security/issues/3771 which was supposed to have been fixed, but there's at least one other report of this error in 2.14 here: https://github.com/opensearch-project/security/issues/3771#issuecomment-2136941571 CC: @cwperks
Possibly is related to https://github.com/opensearch-project/performance-analyzer/issues/606
Still-open PR to fix: https://github.com/opensearch-project/performance-analyzer/pull/609
I'm seeing the same issue upgrading from 2.12 to 2.14
@Jakob3xD It would be helpful for debugging to understand the setup that produced this error. What plugins or features are enabled?
When this error was seen on upgrade from < 2.11 to 2.11, it was because PerformanceAnalyzer had wrapped an instance of a TransportChannel and was not delegating getVersion()
implementation to the wrapped channel. I opened a PR in PA to fix the wrapping issue, but also opened a different PR in the security repo that fixed the issue in the security repo and did not require the PA PR to be merged (though it should merge since its the strategic fix)
Backwards compatibility for serialization requires the receiving or transmitting node of a transport request to know the correct version of the target node. The target node being the node that the transport request is being sent to (from the transmitter end) or the node that sent the transport request (on the receiver end)
It would be helpful to know the concrete className for connection
here: https://github.com/opensearch-project/security/blob/9caf5cbaa90c2ccbaf8b0bf00ee8bb1fe9326919/src/main/java/org/opensearch/security/transport/SecurityInterceptor.java#L155
It would also be helpful the know the concrete className for channel
here: https://github.com/opensearch-project/security/blob/9caf5cbaa90c2ccbaf8b0bf00ee8bb1fe9326919/src/main/java/org/opensearch/security/ssl/transport/SecuritySSLRequestHandler.java#L96
We're also seeing this upgrading from 2.12.0
to 2.14.0
. We have the PA plugin installed but it is disabled via env variable DISABLE_PERFORMANCE_ANALYZER_AGENT_CLI=true
. Note, @Jakob3xD didn't seem to have the opensearch-performance-analyzer
plugin but still had this error.
opensearch-alerting
opensearch-anomaly-detection
opensearch-asynchronous-search
opensearch-cross-cluster-replication
opensearch-custom-codecs
opensearch-flow-framework
opensearch-geospatial
opensearch-index-management
opensearch-job-scheduler
opensearch-knn
opensearch-ml
opensearch-neural-search
opensearch-notifications
opensearch-notifications-core
opensearch-observability
opensearch-performance-analyzer
opensearch-reports-scheduler
opensearch-security
opensearch-security-analytics
opensearch-skills
opensearch-sql
repository-s3
PA is disabled at startup:
Enabling OpenSearch Security Plugin
Disabling execution of install_demo_configuration.sh for OpenSearch Security Plugin
Disabling execution of /usr/share/opensearch/bin/opensearch-performance-analyzer/performance-analyzer-agent-cli for OpenSearch Performance Analyzer Plugin
error does reference performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived
:
{"type": "console", "timestamp": "2024-06-24T23:42:24,719Z", "level": "WARN", "component": "o.o.i.e.Engine", "cluster.name": "some-cluster", "node.name": "node-1", "message": " [some-index-2024-06-24][1] failed engine [primary shard [[some-index-2024-06-24][1], node[asdf], [P], s[STARTED], a[id=fdsa]] was demoted while failing replica shard]", "cluster.uuid": "ffff-A", "node.id": "asdf" ,
"stacktrace": ["org.opensearch.OpenSearchException: java.lang.IllegalArgumentException: -84 is not a valid id",
"at org.opensearch.security.support.Base64CustomHelper.deserializeObject(Base64CustomHelper.java:136) ~[?:?]",
"at org.opensearch.security.support.Base64Helper.deserializeObject(Base64Helper.java:46) ~[?:?]",
"at org.opensearch.security.auditlog.impl.AbstractAuditLog.getUser(AbstractAuditLog.java:775) ~[?:?]",
"at org.opensearch.security.auditlog.impl.AbstractAuditLog.logMissingPrivileges(AbstractAuditLog.java:209) ~[?:?]",
"at org.opensearch.security.auditlog.impl.AuditLogImpl.logMissingPrivileges(AuditLogImpl.java:164) ~[?:?]",
"at org.opensearch.security.auditlog.AuditLogSslExceptionHandler.logError(AuditLogSslExceptionHandler.java:72) ~[?:?]",
"at org.opensearch.security.ssl.transport.SecuritySSLRequestHandler.messageReceived(SecuritySSLRequestHandler.java:170) ~[?:?]",
"at org.opensearch.security.OpenSearchSecurityPlugin$6$1.messageReceived(OpenSearchSecurityPlugin.java:795) ~[?:?]",
"at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptor$interceptHandler$1.messageReceived(RollupInterceptor.kt:114) ~[?:?]",
"at org.opensearch.performanceanalyzer.transport.PerformanceAnalyzerTransportRequestHandler.messageReceived(PerformanceAnalyzerTransportRequestHandler.java:43) ~[?:?]",
"at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundHandler.handleRequest(InboundHandler.java:271) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundHandler.messageReceived(InboundHandler.java:144) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundHandler.inboundMessage(InboundHandler.java:127) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.TcpTransport.inboundMessage(TcpTransport.java:770) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:175) ~[opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:150) [opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:115) [opensearch-2.14.0.jar:2.14.0]",
"at org.opensearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:95) [transport-netty4-client-2.14.0.jar:2.14.0]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) [netty-handler-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) ~[netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[netty-codec-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475) [netty-handler-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338) [netty-handler-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387) [netty-handler-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529) [netty-codec-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468) [netty-codec-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290) [netty-codec-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562) [netty-transport-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997) [netty-common-4.1.109.Final.jar:4.1.109.Final]",
"at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.109.Final.jar:4.1.109.Final]",
"at java.lang.Thread.run(Thread.java:1583) [?:?]",
"Caused by: java.lang.IllegalArgumentException: -84 is not a valid id",
"at org.opensearch.security.support.Base64CustomHelper$CustomSerializationFormat.fromId(Base64CustomHelper.java:63) ~[?:?]",
"at org.opensearch.security.support.Base64CustomHelper.deserializeObject(Base64CustomHelper.java:117) ~[?:?]",
"... 50 more"] }
@Jakob3xD It would be helpful for debugging to understand the setup that produced this error. What plugins or features are enabled?
The plugins are listed in the issue. PA is completely removed during the image build process via /usr/share/opensearch/bin/opensearch-plugin remove opensearch-performance-analyzer
. The only difference to my other clusters is the cluster configuration of remote clusters and the configuration of an awareness attribute.
To give more details on how i encountered the bug. I upgraded all my remote clusters from 2.13 to 2.14 without any issues and afterwards I wanted to upgrade the main clusters, where all the remote targets are configured. First two node reboots worked without any issue but after the third node thinks started to fail with the already named exception. I fixed this issue like the last time where the upgrade to 2.11 failed, by doing a full cluster upgrade and not the rolling upgrade. After doing the full cluster upgrade by shutting the cluster down and starting it with 2.14 the exceptions are gone.
It would be helpful to know the concrete className for
connection
here: https://github.com/opensearch-project/security/blob/9caf5cbaa90c2ccbaf8b0bf00ee8bb1fe9326919/src/main/java/org/opensearch/security/transport/SecurityInterceptor.java#L155It would also be helpful the know the concrete className for
channel
here: https://github.com/opensearch-project/security/blob/9caf5cbaa90c2ccbaf8b0bf00ee8bb1fe9326919/src/main/java/org/opensearch/security/ssl/transport/SecuritySSLRequestHandler.java#L96
Not sure how I would get those? I am not familiar with java.
Based on the stack trace it appears the issue is this:
2.12 node ------- transport request ------> 2.14 node
When the 2.12 node is serializing its opting to use JDK serialization, but the 2.14 node is trying to deserialize as custom.
The 2.12 node should be opting to use custom serialization since the target node is >= 2.11: https://github.com/opensearch-project/security/blob/2.12/src/main/java/org/opensearch/security/transport/SecurityInterceptor.java#L153
@briend Do you also have cross cluster configured?
@Jakob3xD Is the error seen on the replica cluster?
@briend Do you also have cross cluster configured?
No, we don't have cross cluster search configured on the cluster that had the issue. I was also able to recover by doing a full upgrade instead of rolling (shut all nodes down and start up with 2.14.0
).
@Jakob3xD Is the error seen on the replica cluster?
Nope, only happened on the main cluster. And they are not replica clusters. They are just configures as remote cluster to be able to query all logs via one opensearch.
I could not reproduce the error with the security plugin disabled, FWIW
[Triage] Hi @Jakob3xD, thanks for filing this issue. This looks like a good bug to get fixed. Going to mark as triaged.
When this error was seen on upgrade from < 2.11 to 2.11, it was because PerformanceAnalyzer had wrapped an instance of a TransportChannel and was not delegating getVersion() implementation to the wrapped channel.
Just to add, we didn't had PA enabled but still saw the issue happening while upgrading from < 2.10 to 2.11.1 and now same while upgrading to 2.15. For ref: https://github.com/opensearch-project/security/issues/4085
Reposting with better summary- In 2.11, custom serialization was introduced with https://github.com/opensearch-project/security/pull/2802
In 2.14, custom serialization was disabled and the plugin moved back to JDK serialization with https://github.com/opensearch-project/security/pull/4264
Hence, I think for a cluster upgrading from 2.11/2.12/2.13 to any 2.14+ version, the cluster can run into this issue due to different serialization methods. Similarly, a cluster upgrading from < 2.11 to 2.11-2.13 versions can run into the same issue.
In 2.11/2.12/2.13 - serialization method is decided using version check
In 2.14+ - serialization is decided using version check in SerializationFormat
To fix the issue, one of the possible options is to pick the serialization method using the source node version of the transport request. The fix might be similar to https://github.com/opensearch-project/security/pull/3826/
Hey Everyone, Is anyone working on fixing this? It would be great if someone can help pick this up as it's hampering upgrades to the newer version after 2.10. Or maybe we introduce variable to pick the serialisation technique like suggested here
@Dhruvan1217 One thing to consider is jumping to 2.14 and above which has Custom serialization disabled by default. I have not been able to reproduce this issue. Can you share a little more information about the requests that receive this error?
One thing I have been looking into is whether its possible that there are multiple transport hops happening while serving a request.
i.e.
Client ----> 2.14 Node (Coordinator node) -----> 2.14 node -----> 2.12 node.
The header that is causing problems that @Jakob3xD pointed out is first populated on the 2.14 node that receives the transport request from another 2.14 node and uses JDK serialization. If there's another transport hop from the 2.14 node -> 2.12 node then it re-uses the header which is JDK serialized. When the 2.12 node receives the request it assumes its custom serialized because the logic on the 2.12 branch looks to see if the transmitter is >= 2.11.
This is all hypothetical though, I haven't been able to replicate.
An example of an API that uses multiple transport hops is indexing a document.
When making a request to insert a document the coordinator node forwards the request to the nodes with the primaries which then forward the request to the nodes containing replicas.
I looked into whether indexing a document would encounter this error, but OpenSearch ensures that replicas are always on the same node version or greater than the primaries. Its not possible to have a primary on a newer node version and a replica on an older node version.
Update: I was able to reproduce it with the scenario from above. It happens when replica shards are on an older node than the primary.
Not sure if I can reproduce it again, but these were the steps I followed:
Start with 3 2.12 nodes
curl -XPUT https://admin:admin@localhost:9200/movies -k -H "Content-Type: application/json" --data '{
"settings": {
"index": {
"number_of_shards": 1,
"number_of_replicas": 2
}
}
}'
curl -XPOST "https://admin:admin@localhost:9201/movies/_bulk?pretty" -k -H 'Content-Type: application/json' -d '
{ "index": { "_index": "movies", "_id": "1" } }
{ "title": "The Matrix", "year": 1999, "genre": "Sci-Fi", "director": "Lana Wachowski, Lilly Wachowski" }
{ "index": { "_index": "movies", "_id": "2" } }
{ "title": "Inception", "year": 2010, "genre": "Action", "director": "Christopher Nolan" }
{ "index": { "_index": "movies", "_id": "3" } }
{ "title": "Interstellar", "year": 2014, "genre": "Adventure", "director": "Christopher Nolan" }
'
curl -XPUT "https://admin:admin@localhost:9200/_cluster/settings" -k -H 'Content-Type: application/json' -d '
{
"persistent": {
"cluster.routing.allocation.node_concurrent_recoveries": "0"
}
}'
Reboot the node with a primary shard as 2.14
curl -XPOST "https://admin:admin@localhost:9200/_cluster/reroute?pretty" -k -H 'Content-Type: application/json' -d '
{
"commands": [
{
"move": {
"index": "movies",
"shard": 0,
"from_node": "<older_name>",
"to_node": "<newer_node>"
}
}
]
}'
curl -XPUT "https://admin:admin@localhost:9200/_cluster/settings" -k -H 'Content-Type: application/json' -d '
{
"persistent": {
"cluster.routing.allocation.node_concurrent_recoveries": "2"
}
}'
Check shard allocation
curl -XGET https://admin:admin@localhost:9200/_cat/shards\?v -k
Send a request to a node that does not have a shard for the movies
index allocated to it
@cwperks This situation makes sense to me.
OpenSearch ensures that replicas are always on the same node version or greater than the primaries.
I think during the upgrade, there are chances of Primary being on an updated node and replica on an old node.
Describe the bug
With the rolling upgrade from 2.13.0 to 2.14 one of my Opensearch clusters starts to fail and throwing exceptions. After restarting the third of six hardware nodes and doing the upgrade of 2.13 to 2.14 some indices went from yellow to red caused by shard failures with the following exception:
Related component
Other
To Reproduce
Not know. The only big difference to my other clusters is, that the failing cluster is doing cross cluster search and therefore has remote targets configured.
Expected behavior
The upgrade from 2.13 to 2.14 should happen without such exceptions.
Additional Details
Plugins
Host/Environment (please complete the following information):
Additional context Similar exception already existed in the past when updating from 2.10 to 2.11. https://github.com/opensearch-project/OpenSearch/issues/11491