clusterctl init with harvester InfrastructureProvider causes crashloop in caphv-controller pod

bcdurden commented 6 months ago

What happened: When installing cluster-api resources on to the bootstrap KinD cluster, the resulting caphv-controller pod will crash-loop without showing any particular error.

What did you expect to happen: Expected to see a running cluster-api infrastructure in the bootstrap KinD cluster as shown in the README.

How to reproduce it: Follow the README as described.

Anything else you would like to add: Have tried on KinD latest and 1.26.3. Also have tried on K3D. vSphere provider works fine.

Environment:

rke provider version: default / v0.2.7

OS (e.g. from /etc/os-release):

PRETTY_NAME="Ubuntu 22.04.3 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.3 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

Logs of caphv-controller

2024-03-28T17:25:55Z    INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
2024-03-28T17:25:55Z    INFO    setup   starting manager
2024-03-28T17:25:55Z    INFO    Starting server {"kind": "health probe", "addr": "[::]:8081"}
2024-03-28T17:25:55Z    INFO    starting server {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
I0328 17:25:55.701404       1 leaderelection.go:245] attempting to acquire leader lease caphv-system/1e1658d6.cluster.x-k8s.io...
I0328 17:25:55.705868       1 leaderelection.go:255] successfully acquired lease caphv-system/1e1658d6.cluster.x-k8s.io
2024-03-28T17:25:55Z    DEBUG   events  caphv-controller-manager-645bdf8d77-fsvdj_e867e880-4d85-4bc9-8b7e-3bcaabe67677 became leader    {"type": "Normal", "object": {"kind":"Lease","namespace":"caphv-system","name":"1e1658d6.cluster.x-k8s.io","uid":"49415ac9-6a7e-4ce9-9bc4-ae7bd4d2a294","apiVersion":"coordination.k8s.io/v1","resourceVersion":"1172"}, "reason": "LeaderElection"}
2024-03-28T17:25:55Z    INFO    Starting EventSource    {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "source": "kind source: *v1alpha1.HarvesterMachine"}
2024-03-28T17:25:55Z    INFO    Starting EventSource    {"controller": "harvestercluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterCluster", "source": "kind source: *v1alpha1.HarvesterCluster"}
2024-03-28T17:25:55Z    INFO    Starting EventSource    {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "source": "kind source: *v1beta1.Machine"}
2024-03-28T17:25:55Z    INFO    Starting EventSource    {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "source": "kind source: *v1beta1.Cluster"}
2024-03-28T17:25:55Z    INFO    Starting Controller {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine"}
2024-03-28T17:25:55Z    INFO    Starting EventSource    {"controller": "harvestercluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterCluster", "source": "kind source: *v1.Secret"}
2024-03-28T17:25:55Z    INFO    Starting Controller {"controller": "harvestercluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterCluster"}
2024-03-28T17:25:55Z    INFO    Starting workers    {"controller": "harvestercluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterCluster", "worker count": 1}
2024-03-28T17:25:55Z    INFO    Starting workers    {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "worker count": 1}
2024-03-28T17:27:09Z    INFO    Stopping and waiting for non leader election runnables
2024-03-28T17:27:09Z    INFO    shutting down server    {"path": "/metrics", "kind": "metrics", "addr": "127.0.0.1:8080"}
2024-03-28T17:27:09Z    INFO    Stopping and waiting for leader election runnables
2024-03-28T17:27:09Z    INFO    Shutdown signal received, waiting for all workers to finish {"controller": "harvestercluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterCluster"}
2024-03-28T17:27:09Z    INFO    Shutdown signal received, waiting for all workers to finish {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine"}
2024-03-28T17:27:09Z    INFO    All workers finished    {"controller": "harvestercluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterCluster"}
2024-03-28T17:27:09Z    INFO    All workers finished    {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine"}
2024-03-28T17:27:09Z    INFO    Stopping and waiting for caches
2024-03-28T17:27:09Z    INFO    Stopping and waiting for webhooks
2024-03-28T17:27:09Z    INFO    Wait completed, proceeding to shutdown the manager

bcdurden commented 5 months ago

I wanted to provide some extra info after a colleague asked me to run a test.

It seems I did not realize there are two containers in the controller, rookie mistake. It seems the other container is the culprit, though the path to fixing it I am not sure.

the kube-rbac-proxy container throws a cli error when this crash happens. It seems that whatever image is being used here is either not the correct one or the version is wrong.

flag provided but not defined: -secure-listen-address
Usage of /manager:
  -health-probe-bind-address string
        The address the probe endpoint binds to. (default ":9440")
  -kubeconfig string
        Paths to a kubeconfig. Only required if out-of-cluster.
  -leader-elect
        Enable leader election for controller manager. Enabling this will ensure there is only one active controller manager.
  -metrics-bind-address string
        The address the metric endpoint binds to. (default ":8080")
  -zap-devel
        Development Mode defaults(encoder=consoleEncoder,logLevel=Debug,stackTraceLevel=Warn). Production Mode defaults(encoder=jsonEncoder,logLevel=Info,stackTraceLevel=Error) (default true)
  -zap-encoder value
        Zap log encoding (one of 'json' or 'console')
  -zap-log-level value
        Zap Level to configure the verbosity of logging. Can be one of 'debug', 'info', 'error', or any integer value > 0 which corresponds to custom debug levels of increasing verbosity
  -zap-stacktrace-level value
        Zap Level at and above which stacktraces are captured (one of 'info', 'error', 'panic').
  -zap-time-encoding value
        Zap time encoding (one of 'epoch', 'millis', 'nano', 'iso8601', 'rfc3339' or 'rfc3339nano'). Defaults to 'epoch'.

Here are the args in the deployment:

- args:
    - --secure-listen-address=0.0.0.0:8443
    - --upstream=http://127.0.0.1:8080/
    - --logtostderr=true
    - --v=0

I tried editing the args in place, but all of them cause an error of some sort and removing them altogether causes a conflict with the other container sharing the 8080 port:

2024-04-26T18:51:58Z    INFO    controller-runtime.metrics  Metrics server is starting to listen    {"addr": "127.0.0.1:8080"}
2024-04-26T18:51:58Z    ERROR   controller-runtime.metrics  metrics server failed to listen. You may want to disable the metrics server or use another port if it is due to conflicts   {"error": "error listening on 127.0.0.1:8080: listen tcp 127.0.0.1:8080: bind: address already in use"}
sigs.k8s.io/controller-runtime/pkg/metrics.NewListener
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/metrics/listener.go:48
sigs.k8s.io/controller-runtime/pkg/manager.New
    /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.15.1/pkg/manager/manager.go:464
main.main
    /workspace/main.go:83
runtime.main
    /usr/local/go/src/runtime/proc.go:250
2024-04-26T18:51:58Z    ERROR   setup   unable to start manager {"error": "error listening on 127.0.0.1:8080: listen tcp 127.0.0.1:8080: bind: address already in use"}
main.main
    /workspace/main.go:92
runtime.main
    /usr/local/go/src/runtime/proc.go:250

belgaied2 commented 5 months ago

@bcdurden The file components.yaml used by clusterctl init to install the provider had 2 issues:

the kube-auth-proxy was referencing the wrong image, which explains why the flag not defined error.
The Liveness probe was configured incorrectly (command argument was using port 8081 whereas Healtcheck was using 9440 . Both issues should be fixed now (modified components.yaml file in the release, no change to the code).

belgaied2 commented 5 months ago

Fixed

rancher-sandbox / cluster-api-provider-harvester

clusterctl init with harvester InfrastructureProvider causes crashloop in caphv-controller pod #23