Autofailover Stuck in Unable to Find Pod Status After graphd scale down

Please check the FAQ documentation before raising an issue

Describe the bug (required)

Snap reported that graphd failed to scale up and start new pods after nebula autoscaler increased the number of graphd replicas from 2 to 4. The number of desired replicas in kubectl describe for both the autoscaler and the nebula cluster are correct, but no new pods were started. Further investigation reviewed the error E1007 18:17:25.249973 1 nebula_cluster_controller.go:196] NebulaCluster [cb/cb] reconcile failed: rebuilt graphd pod [cb/cb-graphd-2] not found, skip in the operator log which was thrown during auto failover when checking the status of new pods. Also kubectl get pods revels only 2 graphd pods. This happened due to the following sequence:

Auto failover was triggered for a graphd pod due to a failure such as node down.
A new pod was started, but in pending state
Before the pod went into running state, nebula autoscaler was triggered to scale down graphd causing the new pod to be terminated.
However the new pod was never removed from the auto failover map because the new pod was never in running state (currently auto failover only removes the pod from its map when it reaches running state).
As a result, auto failover gets stuck trying to look for a pod that doesn't exist. The new graphd pods fail to start because currently scaling happens after auto failover succeeds.

Solution: Remove the pod from the auto failover map when it's terminated.

Your Environments (required)

Any kubernetes cluster with local-pv and nebula-scheduler

How To Reproduce(required)

Steps to reproduce the behavior:

Start a kubernetes cluster and deploy nebula graph with multiple graphd pods. Make sure auto failover and local pv are turned on.
Deploy nebula autoscaler with maximum replicas >= the current graphd replicas.
Cordon a node running a graphd pod and wait for the affected graphd pod to go into pending status.
Modify the nebula autoscaler and set maximum replicas to < the current graphd replicas.
Wait for the new pod should be terminated.
Modify the nebula autoscaler again and set maximum replicas to > the current graphd replicas.
New pods will fail to start and the cluster will be stuck in auto failover state with the error in the description in the operator log.

Expected behavior

Graphd should scale up and start new pods successfully

Additional context

All related logs and cluster config is attached

vesoft-inc / nebula-operator

Autofailover Stuck in Unable to Find Pod Status After graphd scale down #529