Closed Dhruvan1217 closed 4 months ago
@cwperks Tagging you if you can take a look at this further https://github.com/opensearch-project/OpenSearch/issues/11491
This issue is in performance-analyzer, transferring to that repo
When the second node is being rebooted, could there be an inflight transport request that gets resumed when the second node is brought back up as a 2.12 node?
I'd have to dig into it further, but this line determines what serialization to use when sending a transport request. It could be the case that the first node that was replaced with a 2.12 node is in the middle of sending a request to the second 1.3 node that is rebooted before replying back to the 2.12 node. When the node comes back online, could it be that the transport request is being replayed?
@peternied We don't have PA enabled, this is happening without that plugin and this seems to be generic security module problem. We should move it back to security repo. Thanks
@opensearch-project/admin Please transfer this issue to the security repo.
[Triage] Hi @Dhruvan1217, thanks for filing this issue and providing detailed reproduction steps. Someone will take a look and see what the problem is and note steps forward.
Also AFAIU, there were no issues in Opensearch 2.10, so maybe we can take a look at what was changed there after (I believe introduction of custom serializartion/de-serialization)
What is the bug?
While upgrading the Opensearch cluster from 1.3.9 to 2.12, the shards are failing to recover. The data is unable to replicate to the replicas with following exception in the logs.
NOTE: This issue was also showing up in 2.11.x and was expected to be fixed in 2.12.0, Reference: https://github.com/opensearch-project/OpenSearch/issues/11491
How can one reproduce the bug? Steps to reproduce the behavior: 1) Setup a 1.3.x cluster (consider 3 node cluster) 2) Ingest some data to indices and let the data continue to flow in (possibly also create new indices in-between node restarts) 3) Set shard allocation to
primaries
4) Start in-place upgrade to 2.12.0 by upgrading the nodes one by one 5) Reboot the first node, once it is initialized, let the cluster turn green (by setting allocation toall
) before restarting the following nodes. 6) Set allocation toprimaries
again and Reboot the second node. 7) As soon as the second node is initialised and then the allocation is set toall
, if a replica shard comes to this node for which the primary shard is on the first node upgraded, you will see errors in the logs when the first node tries to write/replicate the data to the replica shard.What is the expected behavior? The data write to shard replicas should be successful.
What is your host/environment?