Open jansoukup opened 8 months ago
I would guess that maybe a controller is stuck.
This could be confirmed via metrics (active workers or something) and via a go routine dump of the controller (via kill -ABRT
)
We feel we have the same problem, especially when deleting clusters nothing really happens until we restart CAPV. Will try to dump the controller next time
We feel we have the same problem, especially when deleting clusters nothing really happens until we restart CAPV. Will try to dump the controller next time
Could you please note which version of CAPV you had been using when this issue occured?
for this env the combo is:
NAME NAMESPACE TYPE CURRENT VERSION NEXT VERSION
addon-helm caaph-system AddonProvider v0.2.3 v0.2.4
bootstrap-talos cabpt-system BootstrapProvider v0.6.5 Already up to date
control-plane-talos cacppt-system ControlPlaneProvider v0.5.6 Already up to date
cluster-api capi-system CoreProvider v1.7.2 v1.7.3
infrastructure-vsphere capv-system InfrastructureProvider v1.10.0 v1.10.1
ill try updating all of them and see.
Please check if you have --enable-keep-alive
set: https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/2896. It should not be set, it can lead to deadlocks and we've dropped it on main already.
Please check if you have
--enable-keep-alive
set: #2896. It should not be set, it can lead to deadlocks and we've dropped it on main already.
Not set!
I would guess that maybe a controller is stuck.
This could be confirmed via metrics (active workers or something) and via a go routine dump of the controller (via
kill -ABRT
)
^^ Should help to figure out where the controller is stuck
Same bug after update CAPV controller from v1.8.4 to v1.10.0 CAPI from v1.5.3 to v1.7.2
There's no way for anyone to debug this without a go routine dump / stack traces. Until then we can only recommend for anyone using older versions to ensure --enable-keep-alive
is set to false
/kind bug
We have 1 management cluster with 7 workload clusters. Each workload cluster has ~25 worker nodes. Sometimes during the reconciliation of all workload clusters, CAPV stops reconciling without any significant information in the logs (nor in CAPI logs). No new VMs are visible in vCenter, nothing is deleted, and new Machines remain in the "Provisioning" state indefinitely. The quickest fix is to restart the CAPV deployment, after which everything runs smoothly again.
CAPV controller manager:
CAPI controller manager:
Omit state where CAPV runs in this strange state, without any info in logs.
Our workaround is scheduled Job for CAPV restart twice per day.
Environment:
kubectl version
): 1.24.17/etc/os-release
): Ubuntu 22.04