[BUG] Scaling down nodePool doesn't reassing all shards

pbagona commented 2 weeks ago

What is the bug?

When scaling down nodePool, Operator shows messages about drain of removed node, but after it finished health status is Red and some shards remain unassigned.

How can one reproduce the bug?

Current setup is I have 4 nodepools - master with 3 replica (role master), nodes with 2 replica each 300Gi storage (role data+ingest), ingests with 3 replicas each 100Gi storage (role ingests) and data with 5 replica each 1Ti storage (role data). Scaling down nodePool nodes introduces issues with shard allocation and health status of cluster.

What is the expected behavior?

Expected behavior is that after Operator drains node and decommissions it, cluster health status will be Green.

What is your host/environment?

k8s v1.27.13, OpenSearch k8s operator 2.5.1, OpenSearch cluster 1.3.16

Do you have any screenshots?

yes I will post screenshots

Do you have any additional context?

nodePool nodes and data existed first, then ingests was added and now goal is to remove old nodes nodePool.

I did this same setup with OpenSearch cluster version 2.XX on different k8s cluster and it worked as expected - when one nodePool was removed, operator drained nodes of that nodePool 1 by 1 and removed them and there was no interruption to service and after it finished cluster health status remained green.

When doing these same steps on OpensSearch cluster version 1.3.16, it results it cluster health status RED and some shard unable to allocate. Sometimes it is 1 shard that remains unallocated, sometimes more shards.

I tried removing nodePool in manifest specification all at once and then I tried just to scale it down by one replica but got same outcome.

In operator logs, I see that it correctly waits for node to drain and then decomissions it, but at that very moment cluster goes into RED state and I see error with allocations.

When I add removed nodepool/replica back to manifest, after pod is up and running status of cluster is back to Green and everything is behaving normally.

I tried this several times and get one of few errors about allocation every time.

Also as seen in screenshots bellow, before scaling down, nodes have 12.3gb used storage for disk.indicies but when one of nodes in nodePool gets removed, number of shards seems to be redistributed, but disk.indicies number stays same for all nodes, or changes just minimally but does not cover 12.3gb that should have been relocated to remaining nodes... and when nodePool is scaled back up to original and removed Pod mounts its old PV, everything is back to normal green state.

{"level":"debug","ts":"2024-09-11T17:39:26.493Z","logger":"events","msg":"Start to Exclude int2-opensearch/int2-opensearch","type":"Normal","object":{"kind":"OpenSearchCluster","namespace":"int2-opensearch","name":"int2-opensearch","uid":"4b093d1c-5644-411a-a010-af0c78faf969","apiVersion":"opensearch.opster.io/v1","resourceVersion":"173437926"},"reason":"Scaler"}
{"level":"info","ts":"2024-09-11T17:39:26.531Z","msg":"Group: nodes, Node int2-opensearch-nodes-1 is drained","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"int2-opensearch","namespace":"int2-opensearch"},"namespace":"int2-opensearch","name":"int2-opensearch","reconcileID":"b99653cd-8ca2-46c7-ba4a-b558f966345a"}
{"level":"info","ts":"2024-09-11T17:39:26.546Z","msg":"Reconciling OpenSearchCluster","controller":"opensearchcluster","controllerGroup":"opensearch.opster.io","controllerKind":"OpenSearchCluster","OpenSearchCluster":{"name":"int2-opensearch","namespace":"int2-opensearch"},"namespace":"int2-opensearch","name":"int2-opensearch","reconcileID":"93813285-fb36-4c9e-b2d1-1b2d08e28df5","cluster":{"name":"int2-opensearch","namespace":"int2-opensearch"}}
...
...
{"level":"debug","ts":"2024-09-11T17:39:26.637Z","logger":"events","msg":"Start to Drain int2-opensearch/int2-opensearch","type":"Normal","object":{"kind":"OpenSearchCluster","namespace":"int2-opensearch","name":"int2-opensearch","uid":"4b093d1c-5644-411a-a010-af0c78faf969","apiVersion":"opensearch.opster.io/v1","resourceVersion":"173438036"},"reason":"Scaler"}
{"level":"debug","ts":"2024-09-11T17:39:26.637Z","logger":"events","msg":"Start to decreaseing node int2-opensearch-nodes-1 on nodes ","type":"Normal","object":{"kind":"OpenSearchCluster","namespace":"int2-opensearch","name":"int2-opensearch","uid":"4b093d1c-5644-411a-a010-af0c78faf969","apiVersion":"opensearch.opster.io/v1","resourceVersion":"173438036"},"reason":"Scaler"}

Cluster health status

{
  "cluster_name" : "int2-opensearch",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 12,
  "number_of_data_nodes" : 6,
  "discovered_master" : true,
  "active_primary_shards" : 92,
  "active_shards" : 183,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 1,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 99.45652173913044
}

Allocation before change

Example of allocation after change

Example of unallocated shard explanation

{
  "index" : "***********",
  "shard" : 1,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "NODE_LEFT",
    "at" : "2024-09-11T18:04:19.246Z",
    "details" : "node_left [pDXfgCn9TQuRA5bGR1DKPw]",
    "last_allocation_status" : "no_valid_shard_copy"
  },
  "can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because all found copies of the shard are either stale or corrupt",
  "node_allocation_decisions" : [

EDIT: I did tried it again in order to collect more and noticed that when I scale down nodePool nodes from 2 to 1 ... Operator goes from int2-opensearch-nodes-0, int2-opensearch-nodes-1 to int2-opensearch-nodes-0 and drains node int2-opensearch-nodes-1, during this process, it reallocates some shards to node that is being drained and then node pod is terminated and removed from cluster and logs from operator are as I posted above

prudhvigodithi commented 1 week ago

[Triage] Thanks @pbagona for the detailed description, I assume this has to do something with 1.3.16 version of OpenSearch (since as you mentioned it works with 2.x). Also since 1.3.x is just in maintenance mode I would recommend to use the latest 2.x version of OpenSearch.

Also when the state is RED have to tried to scale down to zero and then scale up (or a fresh restart)?

Thank you @swoehrl-mw @getsaurabh02

swoehrl-mw commented 1 week ago

I concur with @prudhvigodithi here. This looks like a problem with opensearch itself. From your description opensearch is not able to correctly recover some shards if one of the replicas is removed. Since the 1.x version is no longer developed it does not make sense to implement special logic for this in the operator.

opensearch-project / opensearch-k8s-operator