kubernetes-sigs / cloud-provider-equinix-metal

Kubernetes Cloud Provider for Equinix Metal (formerly Packet Cloud Controller Manager)
https://deploy.equinix.com/labs/cloud-provider-equinix-metal
Apache License 2.0
74 stars 26 forks source link

Retry configuring controllers or exit on setup error #140

Closed invidian closed 3 years ago

invidian commented 3 years ago

If API is unavailable during setup and for example we fail to fetch kube-apiserver namespace like in logs below, then CCM is running, but no Service will get IP address assigned until CCM is restarted.

Setup should either be retried or process should exit to avoid getting stuck in broken state.

Logs below confirm it:

Flag --allow-untagged-cloud has been deprecated, This flag is deprecated and will be removed in a future release. A cluster-id will be required on cloud instances.
I0128 21:09:41.741041       1 main.go:235] authToken: '<masked>'
I0128 21:09:41.741123       1 main.go:235] projectID: '0361e5d1-8d07-44af-8ad0-f88c8c0578a2'
I0128 21:09:41.741148       1 main.go:235] load balancer config: ''%!s(MISSING)
I0128 21:09:41.741206       1 main.go:235] metallb://
I0128 21:09:41.741257       1 main.go:235] facility: 'ewr1'
I0128 21:09:41.741296       1 main.go:235] peer ASN: '65530'
I0128 21:09:41.741355       1 main.go:235] local ASN: '65000'
I0128 21:09:41.741438       1 main.go:235] Elastic IP Tag: ''
I0128 21:09:41.741463       1 main.go:235] API Server Port: '6443'
I0128 21:09:41.741486       1 main.go:235] BGP Node Selector: ''
Flag --allow-untagged-cloud has been deprecated, This flag is deprecated and will be removed in a future release. A cluster-id will be required on cloud instances.
Flag --allow-untagged-cloud has been deprecated, This flag is deprecated and will be removed in a future release. A cluster-id will be required on cloud instances.
I0128 21:09:44.575227       1 serving.go:331] Generated self-signed cert in-memory
I0128 21:09:45.652794       1 controllermanager.go:127] Version: v0.0.0-master+$Format:%h$
I0128 21:09:45.663835       1 secure_serving.go:197] Serving securely on [::]:10258
I0128 21:09:45.667055       1 tlsconfig.go:240] Starting DynamicServingCertificateController
E0128 21:09:46.693789       1 cloud.go:110] could not initialize loadbalancer: failed to get kube-system namespace: Get "https://kube-apiserver/api/v1/namespaces/kube-system": dial tcp 100.66.57.80:443: connect: connection refused
I0128 21:09:46.694130       1 node_controller.go:108] Sending events to api server.
W0128 21:09:46.694174       1 cloud.go:152] The Equinix Metal cloud provider does not support InstancesV2

After re-creating CCM pod, Services correctly gets IP assigned.

invidian commented 3 years ago

Well, the dummy fix is the following:

diff --git metal/cloud.go metal/cloud.go
index 6902b27..4fae75c 100644
--- metal/cloud.go
+++ metal/cloud.go
@@ -107,8 +107,7 @@ func (c *cloud) Initialize(clientBuilder cloudprovider.ControllerClientBuilder,
        serviceReconcilers := []serviceReconciler{}
        for _, elm := range c.services() {
                if err := elm.init(clientset); err != nil {
-                       klog.Errorf("could not initialize %s: %v", elm.name(), err)
-                       return
+                       klog.Fatalf("could not initialize %s: %v", elm.name(), err)
                }
                if n := elm.nodeReconciler(); n != nil {
                        nodeReconcilers = append(nodeReconcilers, n)

I think in the long term, reconciler should have access to the namespace informer and fetch it each on each reconcilation loop. This should simplify the whole process.

deitch commented 3 years ago

Part of the problem is that Initialize(), which passed the ControllerClientBuilder, does not return any error. Given that most do ClientOrDie(), I am thinking that your Fatalf()makes sense. In any case, this runs as a Deployment, so it would get restarted by Kubernetes.

For now, please open a PR with the above and I will merge it in.

invidian commented 3 years ago

Created https://github.com/equinix/cloud-provider-equinix-metal/pull/147.

displague commented 3 years ago

Thanks, @invidian. Closing this.

invidian commented 3 years ago

@displague why closing this? The PR I created is barely a workaround IMO. I think code should be restructured, so this operation can be retried and also to support other controllers retry the initialization process.