Closed kevinliu24 closed 3 weeks ago
Currently had Snap work around by doing the following but this still needs to be fixed:
kubectl edit nc
and set Enable Auto Failover
to false
. This will allow the operator to get out of the loop and do the scaling.kubctl edit nc
again and set Enable Auto Failover
to true
again. Auto failover should automatically clear the failed pod since it's now in running state.
Please check the FAQ documentation before raising an issue
Describe the bug (required)
Snap reported that graphd failed to scale up and start new pods after nebula autoscaler increased the number of graphd replicas from 2 to 4. The number of desired replicas in
kubectl describe
for both the autoscaler and the nebula cluster are correct, but no new pods were started. Further investigation reviewed the errorE1007 18:17:25.249973 1 nebula_cluster_controller.go:196] NebulaCluster [cb/cb] reconcile failed: rebuilt graphd pod [cb/cb-graphd-2] not found, skip
in the operator log which was thrown during auto failover when checking the status of new pods. Alsokubectl get pods
revels only 2 graphd pods. This happened due to the following sequence:Solution: Remove the pod from the auto failover map when it's terminated.
Related logs are attached below. Snap-na-describe-output.txt cb_nc.txt controller-manager-logs.txt Snap-nc-pods-output.txt
Your Environments (required)
How To Reproduce(required)
Steps to reproduce the behavior:
Expected behavior
Graphd should scale up and start new pods successfully
Additional context
All related logs and cluster config is attached