If I'm right, this logic together has the following effect:
if AzureMachinePool is scaling up or down, or is having an issue (e.g. one VM failed due to bootstrapping and resulting in provisioningState: Failed), the MachinePool does not get reconciled anymore until the ready status changes back again.
this means e.g. that the providerIDList of the MachinePool is not updated anymore
cleanup or addition of new Machines is not processed anymore
This is a bug which can lead to issues with the known machines in a cluster. E.g. cluster-autoscaler with clusterapi provider doesn't know about certain machines.
I'm not sure whether the bug is in CAPZ or in CAPI:
the VMSS is still able to scale up or down and function even if the AzureMachinePool is marked not ready. I feel this is a bug in CAPZ
the providerID list should still be updated in CAPI, even if it's not marked ready. I feel this is a bug in CAPI, but as discussed in cluster-api#9858 that depends on the contract
What did you expect to happen:
Scaling up/down works without issues and also a single VM doesn't impact the functioning of the full VMSS.
Anything else you would like to add:
I guess this is initially more of a discussion point because there could be multiple facets of this issue.
/kind bug
What steps did you take and what happened: The following code determines ready state for a AzureMachinePool: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/90797931a191d5baf48bd6fa70c78f2207ad117f/azure/scope/machinepool.go#L571-L603
The following CAPI code is not run if AzureMachinePool is not ready: https://github.com/kubernetes-sigs/cluster-api/blob/8d639f1fad564eecf5bda0a2ee03c8a38896a184/exp/internal/controllers/machinepool_controller_phases.go#L290-L319
If I'm right, this logic together has the following effect:
provisioningState: Failed
), the MachinePool does not get reconciled anymore until the ready status changes back again.This is a bug which can lead to issues with the known machines in a cluster. E.g. cluster-autoscaler with clusterapi provider doesn't know about certain machines.
I'm not sure whether the bug is in CAPZ or in CAPI:
What did you expect to happen: Scaling up/down works without issues and also a single VM doesn't impact the functioning of the full VMSS.
Anything else you would like to add: I guess this is initially more of a discussion point because there could be multiple facets of this issue.
Environment:
kubectl version
): 1.28.5/etc/os-release
): linux/windows