Stuck in cluster restart loop

dermicus-miclip commented 1 year ago

Running 3 master/data nodes with: Opensearch 2.7.0 Operator: 2.3.0

I observe that the cluster keeps trying to restart and it is somehow stuck in a loop where it can't seem to resolve itself from. When doing GET _cluster/settings, I see that the operator has set some transient settings during the rolling restart. After observing the code, that seems logical.

But after it has restarted the nodes, that configuration stays there, and is never removed again by the restart controller as it should. Eventually, the cluster's health becomes yellow. The only way that I can resolve this is by removing the cluster settings that block the shard allocation. This causes the restart reconciler to try again.

My guess is that somewhere in the process, something goes wrong, and it seems like in this scenario the operator's reconciler cannot recover from it. Unfortunately I don't see anything special in the logs that seems to point me towards the root cause.

My guess would be is that it keeps just restarting because it never reaches the end of a full restart. What I also observe:

Controller revisions keep piling up and incrementing
kube-rbac-proxy keeps spitting out errors like http: TLS handshake error from $IP:$PORT EOF. When searching for issues there some comments seem to indicate that this is not critical, but it seems to happen pretty often.
Sometimes restarting the full statefulset also fails. Then only one node restarts or two instead of all three replicas. When that happens, it also gets stuck since it cannot restart the last node anymore as the cluster's health has again degraded to yellow.

Example of a failed rolling restart:

nodes-0 and nodes-1 have already restarted, but node-2 still has to. I now get back from GET _cluster/settings:

{
  "transient": {
    "cluster": {
      "routing": {
        "allocation": {
          "exclude": {
            "_name": ",monitoring-cluster-masters-0,monitoring-cluster-masters-1"
          }
        }
      }
    }
  }
}

Indicating that the nodes that just restarted cannot be allocated for sharding. In this scenario, cluster health will turn yellow, which will lead to the reconciler being stuck as it cannot continue with an unhealthy cluster.

dermicus-miclip commented 1 year ago

Seems related with #446. I also have keystore configuration, which seems to cause the cluster to keep restarting. That diff is always there. Not entirely sure if this is the root issue though, but will investigate if this is a dup.

Edit: when would this fix be released?

swoehrl-mw commented 1 year ago

Hi @dermicus-miclip This is a bit of a weird situation, because from the code the operator should immediately remove the exclude in the opensearch settings as soon as it has deleted a pod in kubernetes. Can you check the status field of the cluster spec (with kubectl get opensearchcluster yyy -o yaml)? There should be some insight into what the operator is doing. As you mentioned #446 could be involved, hard to tell.

kube-rbac-proxy keeps spitting out errors

That you can ignore, that has no effect on operator operations.

dermicus-miclip commented 1 year ago

@swoehrl-mw, thanks for the response. I'll check it out and get back to you with more info. Will first test with 2.3.0, thanks to you I can also see what happens with the new bugfix release 🎉

dermicus-miclip commented 1 year ago

@swoehrl-mw, I'm closing this one now. With the new operator fix you released, all is fixed! Thanks a lot!

opensearch-project / opensearch-k8s-operator

Stuck in cluster restart loop #524