[Enhancement][Opensearch] Readiness Probes

ic-ruben-burgue commented 1 year ago

Hi :). Hopefully this is not a duplicated issue.

Is your feature request related to a problem? Please describe. During rolling restarts of data nodes, Readiness probes (https://github.com/opensearch-project/helm-charts/blob/main/charts/opensearch/values.yaml#L351) are not waiting the cluster to be green again before selecting the data node as Ready. This causes nodes to be restarted really fast causing the cluster to go into RED state quite easily (depending on the number of nodes, shards, replicas...).

It can also break that migration of .kibana indices showing the following message: [resource_already_exists_exception]: index [.kibana_Y/XXXXXXXXX_XXXXXXXX] already exists

Describe the solution you'd like It would be great to have a similar solution ElasticSearch is using on their charts. They are waiting the cluster to be green again before marking the node as Ready.

Describe alternatives you've considered As a temporary fix. I increased the initialDelaySeconds to wait the cluster to load the node and find the shards again. But with a large amount of data, it's required several minutes. it's hard to find a value for this.

Divyaasm commented 1 year ago

@prudhvigodithi Could please look into the issue. Thanks

prudhvigodithi commented 1 year ago

Hey @ic-ruben-burgue thanks for bringing this up, looks similar to https://github.com/opensearch-project/helm-charts/issues/307, the ideal way is to use the cluster health endpoint which is unauthenticated for the probe.

 curl -XGET "http://localhost:9200/_cluster/health?pretty"   
{
  "cluster_name" : "opensearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 0,
  "active_shards" : 0,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

@TheAlgo @bbarani @peterzhuamazon WDYT?

prudhvigodithi commented 1 year ago

Hey I just found one more huddle, this is not true as soon the the security plugin pitches in, example as follows

curl -XGET "https://localhost:9200/_cluster/health?pretty"  -k
Unauthorized

I have tested the command before the security plugin is installed which works fine. This should be ideally open without authentication.

@nknize @dblock I dont see we have any unauthenticated endpoints once the security is installed to check the cluster health, we should have one to test the health of the cluster without authentication before routing the traffic, this will help in ensuring the cluster is green and identify it without any user password, WDYT?

Adding @bbarani @CEHENKLE

Thank you

opensearch-project / helm-charts

[Enhancement][Opensearch] Readiness Probes #402