Open sandervandegeijn opened 2 months ago
@sandervandegeijn Thanks for filing this issue, please feel free to submit a pull request.
2.18: problem still exists.
Scenario, rolling upgrade from 2.17.1 to 2.18. All nodes are upgraded, but clusters sticks to yellow.
Workaround (dirty but....)
solved.
I mean no offense at all. I'm seeing a lot of effort being directed at performance and the tiered caching while these kinds of bugs persist. Of course we are happy with everything that Amazon is contributing and we can't look a gift horse in the mouth. Still, I'm a bit confused with the priorities, performance isn't bad at all so more performance is a (really!) nice to have, but not must have.
Bugs that break upgrades and production environments should take precedence in my very humble opinion.
Describe the bug
We have encountered this bug multiple times, also before 2.16.0.
When cluster nodes are already above the low watermark causing new indices being distributed to other nodes, it can happen that the cluster goes to yellow. The cause seems to be that the default policy on system indices is: auto_expand_replicas: "1-all". It tries to allocate replicas to nodes that are not able to accept more data because of the watermark situation.
This seems to happen when kubernetes is reallocating opensearch nodes to different k8s compute nodes.
Cluster state:
It tries to allocate the replicas:
So if we have 12 nodes, it tries to allocate 11 replicas on the restart of the node. But that seems to fail because several nodes are above the low watermark (why not distribute the free space more evenly?). The only solutions seems to be to lower the auto expand setting or to manually redistribute shards across the nodes to even out the disk space usage.
Cluster storage state:
Related component
Storage
To Reproduce
Cluster is nearing capacity ( good from a storage cost perspective ) Cluster gets rebooted or individual nodes get rebooted Cluster goes to yellow state
Expected behavior
Rebalance shards proactively based on storage usage of nodes System indices might take priority ignoring the low/high watermark untill cluster disk usage really becomes critical
Additional Details
Plugins Default
Screenshots N/A
Host/Environment (please complete the following information): Default 2.16.0 docker images
Additional context N/A