vesoft-inc / nebula-operator

Operation utilities for Nebula Graph
https://vesoft-inc.github.io/nebula-operator
Apache License 2.0
81 stars 29 forks source link

Autofailover Stuck in Unable to Find Pod Status After graphd scale down #529

Closed kevinliu24 closed 3 weeks ago

kevinliu24 commented 1 month ago

Please check the FAQ documentation before raising an issue

Describe the bug (required)

Snap reported that graphd failed to scale up and start new pods after nebula autoscaler increased the number of graphd replicas from 2 to 4. The number of desired replicas in kubectl describe for both the autoscaler and the nebula cluster are correct, but no new pods were started. Further investigation reviewed the error E1007 18:17:25.249973 1 nebula_cluster_controller.go:196] NebulaCluster [cb/cb] reconcile failed: rebuilt graphd pod [cb/cb-graphd-2] not found, skip in the operator log which was thrown during auto failover when checking the status of new pods. Also kubectl get pods revels only 2 graphd pods. This happened due to the following sequence:

  1. Auto failover was triggered for a graphd pod due to a failure such as node down.
  2. A new pod was started, but in pending state
  3. Before the pod went into running state, nebula autoscaler was triggered to scale down graphd causing the new pod to be terminated.
  4. However the new pod was never removed from the auto failover map because the new pod was never in running state (currently auto failover only removes the pod from its map when it reaches running state).
  5. As a result, auto failover gets stuck trying to look for a pod that doesn't exist. The new graphd pods fail to start because currently scaling happens after auto failover succeeds.

Solution: Remove the pod from the auto failover map when it's terminated.

Related logs are attached below. Snap-na-describe-output.txt cb_nc.txt controller-manager-logs.txt Snap-nc-pods-output.txt

Your Environments (required)

How To Reproduce(required)

Steps to reproduce the behavior:

  1. Start a kubernetes cluster and deploy nebula graph with multiple graphd pods. Make sure auto failover and local pv are turned on.
  2. Deploy nebula autoscaler with maximum replicas >= the current graphd replicas.
  3. Cordon a node running a graphd pod and wait for the affected graphd pod to go into pending status.
  4. Modify the nebula autoscaler and set maximum replicas to < the current graphd replicas.
  5. Wait for the new pod should be terminated.
  6. Modify the nebula autoscaler again and set maximum replicas to > the current graphd replicas.
  7. New pods will fail to start and the cluster will be stuck in auto failover state with the error in the description in the operator log.

Expected behavior

Graphd should scale up and start new pods successfully

Additional context

All related logs and cluster config is attached

kevinliu24 commented 1 month ago

Currently had Snap work around by doing the following but this still needs to be fixed:

  1. Run kubectl edit nc and set Enable Auto Failover to false. This will allow the operator to get out of the loop and do the scaling.
  2. Wait for the new pods to start up.
  3. Then run kubctl edit nc again and set Enable Auto Failover to true again. Auto failover should automatically clear the failed pod since it's now in running state.