rancher-sandbox / cluster-api-provider-harvester

A Cluster API Infrastructure Provider for Harvester
Apache License 2.0
21 stars 6 forks source link

Bug: Tigera-Operator Pod stucks in CrashLoopBackOff #36

Closed PatrickLaabs closed 2 months ago

PatrickLaabs commented 3 months ago

What happened: When I try to provision a Kubernetes Cluster - as described in the README.md - my ControlPlane becomes ready, and I am able to connect to the new cluster with the kubeconfig.

Some pods are starting, but the important one 'tiger-operator' keeps restarting:

2024/06/06 10:11:56 [INFO] Version: v1.29.0
2024/06/06 10:11:56 [INFO] Go Version: go1.18.9b7
2024/06/06 10:11:56 [INFO] Go OS/Arch: linux/amd64
2024/06/06 10:11:56 [ERROR] Get "https://10.43.0.1:443/api?timeout=32s": dial tcp 10.43.0.1:443: connect: network is unreachable
clusterctl describe cluster test-rk -n example-rk
NAME                                                                     READY  SEVERITY  REASON  SINCE  MESSAGE
Cluster/test-rk                                                          True                     81m
├─ClusterInfrastructure - HarvesterCluster/test-rk-hv
└─ControlPlane - RKE2ControlPlane/test-rk-control-plane                  True                     81m
  └─Machine/test-rk-control-plane-wvg4g                                  True                     81m
    └─MachineInfrastructure - HarvesterMachine/test-rk-cp-machine-sjtck

What did you expect to happen:

How to reproduce it:

Anything else you would like to add: I am currently trying to deploy a 1-ControlPlane Cluster on Harvester, but the Error also occours, when i try to deploy 1-2 ControlPlanes with 1 Worker Node.

Environment:

PatrickLaabs commented 3 months ago

I also tried the 0.1.0 Version of the Provider, with the rke2 bootstrap and controlplane version of v.0.2.0, but with these Versions I am not even getting to the point of a VM-creation inside my Harvester installation 😢

PatrickLaabs commented 3 months ago

Ok, after some investigation, it seems to work. I had to do some tweaking on my Harvester Instance (Since I am running a Single-Node-Cluster).

It seems just to took a while for Calico to recognize the IP of the Service.

PatrickLaabs commented 3 months ago

Another Update on this one:

I figured out, that using the latest Versions of the rke2 ControlPlane and Bootstrap Provider (Version 0.3.0) is causing some issues on setting the ProviderIDs for the creates nodes. Using the Version 0.2.2 adds the ProviderIDs.

I created a new Cluster with these Versions. The Tigera-Operator may still fail at the beginning, but I just let him do its thing for a while now and restartet the pod. Now its working.

PatrickLaabs commented 3 months ago

Hell, i love this :D Maybe someone can help me on that..

Now that the Tigera Operator seems to work, the calico-system calico-node-x7p65 ● 0/1 Init:CrashLoopBackOff keeps on CrashLoopBackOff on my test-rk-cp-machine-ljlrl node.

on the other node test-rk-workers-cjtjl-mlwgr it is up and running..

calico-system calico-node-kxv5l ● 1/1 Running

install-cni 2024-06-07 13:04:23.740 [INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
install-cni W0607 13:04:23.740567       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.

install-cni 2024-06-07 13:04:23.749 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://10.43.0.1:443/api/v1/namespaces/calico-system/servi
ceaccounts/calico-node/token": dial tcp 10.43.0.1:443: connect: network is unreachable
PatrickLaabs commented 2 months ago

After some investigation, I can tell, that it was a Layer8 Problem.. I just under-sized my ControlPlane deployment.