Open dhaugli opened 4 months ago
So after looking around and thinking a bit I see that our CAPHV is waiting for providerID to be set
2024-06-07T06:13:18Z INFO Waiting for ProviderID to be set on Node resource in Workload Cluster ... {"controller": "harvestermachine", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "HarvesterMachine", "HarvesterMachine": {"name":"capi-mgmt-p-01-7d7pr","namespace":"cluster-capi-mgmt-p-01"}, "namespace": "cluster-capi-mgmt-p-01", "name": "capi-mgmt-p-01-7d7pr", "reconcileID": "d258bfe4-ba85-4d61-92e0-d6ee8aced78d", "machine": "cluster-capi-mgmt-p-01/capi-mgmt-p-01-n48cx", "cluster": "cluster-capi-mgmt-p-01/capi-mgmt-p-01"}
I see that you are in your examples including the CPI as a DaemonSet that means that will not be setting the provideID on TalOS since it needs to be bootstraped before the DaemonSet would be started and the CPI setting providerID? https://github.com/rancher-sandbox/cluster-api-provider-harvester/blob/main/templates/cluster-template-kubeadm.yaml#L190
IMHO the controller should be able to get the HarvesterMachine into a state so that the Machine
object would phase into Provisioned
so that other controllers whether it be Talos' or any else will function with it which is the normal ?
Hi @dhaugli, which version of the rke2 controlplane and bootstrap provider are you using?
Hi @dhaugli, which version of the rke2 controlplane and bootstrap provider are you using?
We are using Talos bootstrap and Talos controlplane provider in this case.
I have followed the example now from templates, but still it dosent work, and I think I know why. Beacause the caph controller dosent propagate the IP address of the machines into the machine object like:
status:
addresses:
- address: <IP>
type: ExternalIP
- address: <IP>
type: ExternalIP
- address: <DNS NAME OF MACHINE>
type: InternalDNS
For reference, Vsphere capi controller does this, without this the TalosBoot controller can't see the ip and can't continue the bootstrap process. But my machines does get IP in my network and the qemu agent does report this through harvester.
I found the issue with the CAPH controller, from the principles from Cluster API on how the bootstrap should work:
CAPH controller does not set the machine as ready in the infrastructure provider (even though its running just fine as a VM in Harvester), because CAPH controller is waiting for Provider Id, and the LB is never created (because of this) and with Talos this will just make the nodes end up waiting forever in the bootstrap process, and will not progress.
My friend Endre just made a fix in our own image, still dosen't work, but we are working on it as well.
What happened: [A clear and concise description of what the bug is.]
The cluster is not coming up, Harvester Loadbalancer is not created, machines never leave provisioning state. The machines is provisioned in harvester, gets IP from my network. I can attach a console to them. Though its Talos so its not much you get in return.
Screenshot of console of one of the talos cp vms:
caph-provider logs:
1)
2) These two log entries keeps going.
capt-controller-manager logs:
cabpt-talos-bootstrap(I dont know if this is relevant):
What did you expect to happen: I expected that the caph provider created the LB and proceeded on creating the cluster.
How to reproduce it:
I added the providers for talos (boostrap and controlplane) and ofcourse the harvester provider.
Added 4 files + the harvester secret with the following configuration:
cluster.yaml:
harvester-cluster.yaml:
harvester-machinetemplate.yaml:
controlplane.yaml:
Anything else you would like to add:
I have tried to switch the Loadbalancer config from dhcp to ipPoolRef, and set a pre-configured ippool this also did not work. I think its related to that the LB is never provisioned in the first place.
[Miscellaneous information that will assist in solving the issue.]
Environment:
/etc/os-release
):