[BUG] 2.16.0-2.18.0 Auto-expend replicas causes cluster yellow state when cluster nodes are above low watermark

sandervandegeijn commented 2 months ago

Describe the bug

We have encountered this bug multiple times, also before 2.16.0.

When cluster nodes are already above the low watermark causing new indices being distributed to other nodes, it can happen that the cluster goes to yellow. The cause seems to be that the default policy on system indices is: auto_expand_replicas: "1-all". It tries to allocate replicas to nodes that are not able to accept more data because of the watermark situation.

This seems to happen when kubernetes is reallocating opensearch nodes to different k8s compute nodes.

Cluster state:

{
  "cluster_name": "xxxxx",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 17,
  "number_of_data_nodes": 12,
  "discovered_master": true,
  "discovered_cluster_manager": true,
  "active_primary_shards": 2631,
  "active_shards": 3140,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 3,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 99.90454979319122
}

It tries to allocate the replicas:

{
  "index": ".opendistro_security",
  "shard": 0,
  "primary": false,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "CLUSTER_RECOVERED",
    "at": "2024-09-12T14:45:27.211Z",
    "last_allocation_status": "no_attempt"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_id": "02CeBVQKTa2lD1Qx0GAS3Q",
      "node_name": "opensearch-data-nodes-hot-6",
      "transport_address": "10.244.33.33:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [8.175061087167675%]"
        }
      ]
    },
    {
      "node_id": "Balhhxf2T2uNpUP6rq88Ag",
      "node_name": "opensearch-data-nodes-hot-2",
      "transport_address": "10.244.86.36:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [9.615515861288957%]"
        }
      ]
    },
    {
      "node_id": "DppvPjxgR0u8CVQVyAX0UA",
      "node_name": "opensearch-data-nodes-hot-7",
      "transport_address": "10.244.97.29:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[DppvPjxgR0u8CVQVyAX0UA], [R], s[STARTED], a[id=Q9PoLV1wRGumidM22EKveQ]]"
        },
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [12.463841799195983%]"
        }
      ]
    },
    {
      "node_id": "LQSYXzHbTfqowAOj3nrU3w",
      "node_name": "opensearch-data-nodes-hot-4",
      "transport_address": "10.244.70.30:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [7.916677463242952%]"
        }
      ]
    },
    {
      "node_id": "Ls8ptyo7ROGtFeO8hY5c5Q",
      "node_name": "opensearch-data-nodes-hot-9",
      "transport_address": "10.244.54.37:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[Ls8ptyo7ROGtFeO8hY5c5Q], [R], s[STARTED], a[id=j_FrjkN7R0aCEokKa4tjCA]]"
        }
      ]
    },
    {
      "node_id": "O_CCkTbmRtiuJU3cV93EaA",
      "node_name": "opensearch-data-nodes-hot-1",
      "transport_address": "10.244.83.46:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [8.445263138130201%]"
        }
      ]
    },
    {
      "node_id": "OfBmEaQsSsuJtJ4TKadLnQ",
      "node_name": "opensearch-data-nodes-hot-10",
      "transport_address": "10.244.37.46:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [11.538695394244522%]"
        }
      ]
    },
    {
      "node_id": "RC5KMwpWRMCVrGaF_7oGBA",
      "node_name": "opensearch-data-nodes-hot-0",
      "transport_address": "10.244.99.67:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [12.185368398769644%]"
        }
      ]
    },
    {
      "node_id": "S_fk2yqhQQuby8HM4hJXVA",
      "node_name": "opensearch-data-nodes-hot-8",
      "transport_address": "10.244.45.64:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [10.432421573093784%]"
        }
      ]
    },
    {
      "node_id": "_vxbOtloQmapzz0DbXBsjA",
      "node_name": "opensearch-data-nodes-hot-5",
      "transport_address": "10.244.79.58:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[_vxbOtloQmapzz0DbXBsjA], [P], s[STARTED], a[id=hY9WcHR-S_6TN3kTj4NZJA]]"
        }
      ]
    },
    {
      "node_id": "pP5muAyTSA2Z45yO8Ws0VA",
      "node_name": "opensearch-data-nodes-hot-3",
      "transport_address": "10.244.101.66:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [9.424146099675534%]"
        }
      ]
    },
    {
      "node_id": "zRdO9ndKSbuJ97t77-OLLw",
      "node_name": "opensearch-data-nodes-hot-11",
      "transport_address": "10.244.113.26:9300",
      "node_attributes": {
        "temp": "hot",
        "shard_indexing_pressure_enabled": "true"
      },
      "node_decision": "no",
      "deciders": [
        {
          "decider": "same_shard",
          "decision": "NO",
          "explanation": "a copy of this shard is already allocated to this node [[.opendistro_security][0], node[zRdO9ndKSbuJ97t77-OLLw], [R], s[STARTED], a[id=O7z4RvkiQXGMcfhRSPm8lQ]]"
        },
        {
          "decider": "disk_threshold",
          "decision": "NO",
          "explanation": "the node is above the low watermark cluster setting [cluster.routing.allocation.disk.watermark.low=87%], using more disk space than the maximum allowed [87.0%], actual free: [11.883587901703455%]"
        }
      ]
    }
  ]
}

So if we have 12 nodes, it tries to allocate 11 replicas on the restart of the node. But that seems to fail because several nodes are above the low watermark (why not distribute the free space more evenly?). The only solutions seems to be to lower the auto expand setting or to manually redistribute shards across the nodes to even out the disk space usage.

Cluster storage state:

n                            id   v      r rp      dt      du   dup hp load_1m load_5m load_15m
opensearch-master-nodes-0    twM5 2.16.0 m 60   9.5gb 518.1mb  5.32 56    1.74    1.38     1.17
opensearch-data-nodes-hot-5  _vxb 2.16.0 d 96 960.1gb 649.6gb 67.66 41    1.14    1.14     1.10
opensearch-master-nodes-2    nQD7 2.16.0 m 59   9.5gb 518.1mb  5.32 37    1.15    1.06     1.09
opensearch-data-nodes-hot-11 zRdO 2.16.0 d 92 960.1gb   859gb 89.47 31    2.33    3.13     3.62
opensearch-data-nodes-hot-6  02Ce 2.16.0 d 90 960.1gb 848.5gb 88.38 62    1.40    1.40     1.60
opensearch-data-nodes-hot-4  LQSY 2.16.0 d 95 960.1gb 886.5gb 92.33 35    2.33    2.40     2.56
opensearch-data-nodes-hot-10 OfBm 2.16.0 d 96 960.1gb 861.7gb 89.75 58    3.69    4.27     4.21
opensearch-ingest-nodes-0    bx4Z 2.16.0 i 65    19gb  1016mb  5.21 73    2.31    2.60     2.54
opensearch-data-nodes-hot-3  pP5m 2.16.0 d 61 960.1gb 869.6gb 90.58 35    1.71    1.64     1.89
opensearch-data-nodes-hot-9  Ls8p 2.16.0 d 95 960.1gb 643.2gb 66.99 27    0.72    1.00     1.02
opensearch-data-nodes-hot-7  Dppv 2.16.0 d 91 960.1gb 842.4gb 87.74 53    1.29    1.87     1.74
opensearch-data-nodes-hot-2  Balh 2.16.0 d 63 960.1gb 867.8gb 90.38 31    1.93    1.73     1.45
opensearch-data-nodes-hot-8  S_fk 2.16.0 d 64 960.1gb 859.9gb 89.57 42    0.66    0.66     0.71
opensearch-data-nodes-hot-1  O_CC 2.16.0 d 89 960.1gb 884.9gb 92.17 11    1.53    1.48     1.33
opensearch-data-nodes-hot-0  RC5K 2.16.0 d 85 960.1gb 844.8gb 87.99 62    0.77    0.90     1.10
opensearch-master-nodes-1    r70_ 2.16.0 m 58   9.5gb 518.1mb  5.32 58    0.76    0.88     1.05
opensearch-ingest-nodes-1    NX1N 2.16.0 i 61    19gb  1016mb  5.21 17    0.49    1.12     1.77

Related component

Storage

To Reproduce

Cluster is nearing capacity ( good from a storage cost perspective ) Cluster gets rebooted or individual nodes get rebooted Cluster goes to yellow state

Expected behavior

Rebalance shards proactively based on storage usage of nodes System indices might take priority ignoring the low/high watermark untill cluster disk usage really becomes critical

Additional Details

Plugins Default

Screenshots N/A

Host/Environment (please complete the following information): Default 2.16.0 docker images

Additional context N/A

ashking94 commented 2 months ago

@sandervandegeijn Thanks for filing this issue, please feel free to submit a pull request.

dblock commented 1 month ago

[Catch All Triage - 1, 2, 3, 4]

sandervandegeijn commented 2 days ago

2.18: problem still exists.

sandervandegeijn commented 2 days ago

Scenario, rolling upgrade from 2.17.1 to 2.18. All nodes are upgraded, but clusters sticks to yellow.

Workaround (dirty but....)

solved.

I mean no offense at all. I'm seeing a lot of effort being directed at performance and the tiered caching while these kinds of bugs persist. Of course we are happy with everything that Amazon is contributing and we can't look a gift horse in the mouth. Still, I'm a bit confused with the priorities, performance isn't bad at all so more performance is a (really!) nice to have, but not must have.

Bugs that break upgrades and production environments should take precedence in my very humble opinion.

opensearch-project / OpenSearch