ionos-cloud / cluster-api-provider-proxmox

Cluster API Provider for Proxmox VE (CAPMOX)
Apache License 2.0
200 stars 25 forks source link

Machine gets stuck in provisioning state #291

Open khushalmer03 opened 2 months ago

khushalmer03 commented 2 months ago

What steps did you take and what happened:


- I'm trying to deploy single controlPlane cluster using proxmox vm template with talos initialized in it. Below is my cluster manifest that I'm using to provision talos cluster in proxmox vm

apiVersion: cluster.x-k8s.io/v1beta1 kind: Cluster metadata: name: talos-test spec: clusterNetwork: pods: cidrBlocks:


apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1 kind: ProxmoxCluster metadata: name: pride spec: controlPlaneEndpoint: host: "10.0.1.164" port: 6443 ipv4Config: addresses: [10.0.1.174-10.0.1.175] prefix: 20 gateway: 10.0.1.1 dnsServers: [10.0.1.1] allowedNodes: [px1] credentialsRef: name: "pride-proxmox-credentials"


apiVersion: controlplane.cluster.x-k8s.io/v1alpha3 kind: TalosControlPlane metadata: name: talos-test spec: version: v1.30.1 replicas: 1 infrastructureTemplate: kind: ProxmoxMachineTemplate apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1 name: talos-cp controlPlaneConfig: init: generateType: init controlplane: generateType: controlplane talosVersion: v1.7.4


apiVersion: infrastructure.cluster.x-k8s.io/v1alpha1 kind: ProxmoxMachineTemplate metadata: name: "talos-cp" spec: template: spec: sourceNode: "px1" templateID: 110 format: "qcow2" full: true numSockets: 1 numCores: 2 memoryMiB: 2048 disks: bootVolume: disk: scsi0 sizeGb: 8 network: default: bridge: vmbr0 model: virtio


Now the machine is created successfully in proxmox with IP assigned to it as well but machine phase in the management cluster is stuck in provisioning state as a result no further action of bootstrapping by talos takes place as it keeps waiting for infrastructure to be ready. Upon cheking the logs of capmox-controller-manager, this is what I found: 

E0923 08:06:57.505071 1 controller.go:329] "Reconciler error" err="failed to reconcile VM: error waiting for agent: the operation has timed out" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="talos-cp2-n9h6b" name="talos-cp2-n9h6b" reconcileID="f55e43c1-ba7d-43f5-bbee-8c1fd90b3202"


Note that I have already enables qemu-agent in the VM template as well. 

Status of machine in management cluster: 

![image](https://github.com/user-attachments/assets/c1a81206-b838-4115-8dad-abbdf2c1589f)

**What did you expect to happen:**
Machine is provisioned successfully and control plane is initialized.

**Environment:**

- Cluster-api-provider-proxmox version:v0.5.1 
- Kubernetes version: (use `kubectl version`):v1.30.1
- OS (e.g. from `/etc/os-release`): talos:v1.7.4
mcbenjemaa commented 4 weeks ago

Please share the cluster manifest, you're using and let us debug

rouke-broersma commented 1 week ago

I have the same issue I think. Relevant capmox logs:

E1122 21:34:46.708828       1 proxmoxmachine_controller.go:209] "error reconciling VM" err="unable to get cloud-init status: no pid returned from agent exec command" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="embla-cluster/embla-proxmox-control-plane-machine-template-8tt4r" namespace="embla-cluster" name="embla-proxmox-control-plane-machine-template-8tt4r" reconcileID="b447c5f3-11b6-41bc-a800-b7a75182246e" machine="embla-cluster/embla-talos-control-plane-qfhts" cluster="embla-cluster/embla-cluster"
E1122 21:34:46.709446       1 controller.go:329] "Reconciler error" err="failed to reconcile VM: unable to get cloud-init status: no pid returned from agent exec command" controller="proxmoxmachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="ProxmoxMachine" ProxmoxMachine="embla-cluster/embla-proxmox-control-plane-machine-template-8tt4r" namespace="embla-cluster" name="embla-proxmox-control-plane-machine-template-8tt4r" reconcileID="b447c5f3-11b6-41bc-a800-b7a75182246e"

Templates: https://github.com/broersma-forslund/homelab/tree/90723ca8735a855e537ac792ce102a8ca5260f8f/apps/infrastructure/embla-cluster/templates

Possibly related to #290? This specifically mentions talos, but seems like this crd version is not yet released. Any chance we could soon get a release with these fixes so we can deploy this with talos?

Edit:

I switched to using a cloud-init compatible talos image, however it seems like the cloud-init config is crashing talos:

image

Seems to be an issue in talos: https://github.com/siderolabs/talos/pull/9352

rouke-broersma commented 6 days ago

Can confirm that the issue I mentioned has been solved in talos 1.9 alpha.3. The only remaining issue I see is that capmox is not updating the node IPs in the machine CR which cause talos to wait with bootstrapping. This can be solved with the skipQemuCheck in the proxmoxmachine CR but this has not yet been released, it's on main only.

khushalmer03 commented 20 hours ago

@rouke-broersma Can you share working manifests ?

rouke-broersma commented 20 hours ago

@rouke-broersma Can you share working manifests ?

https://github.com/broersma-forslund/homelab/tree/main/apps%2Finfrastructure