Closed mweibel closed 1 week ago
FYI currently testing this change: https://github.com/helio/cluster-api-provider-azure/commit/73cdc0db8f2aaba2e9ab0d4b20939d3569d0173c
The change I did seems to somewhat work although with high scale it comes at it's limits because it only works if there no failed/evicted VMSS VMs.
I ran a quick experiment related to the VMSS PUT API.
az rest --method put --url '/subscriptions/{id}/resourcegroups/{rg}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmssName}?api-version=2024-07-01' --body @vmss.json --verbose -o json
Looking at the VMSS right afterwards, I see that the VMSS VMs now report "Latest Model: No". This even though no changes have been made at all.
Is this an issue with the VMSS API or does CAPZ need to verify no changes have been made before executing a CreateOrUpdate on the VMSS?
Looking further into this: when doing an instance scale in the VMSS using Azure Portal, it uses the VMSS PATCH API instead of PUT. This API behaves different in a way that the "Latest model" property is not set to "No" on existing instances. Why this is the case is beyond my knowledge, but it means that CAPZ could generate a patch from the existing VMSS parameters to the new ones and execute a patch call instead.
/priority backlog
it means that CAPZ could generate a patch from the existing VMSS parameters to the new ones and execute a patch call instead.
Apologies for the delay on this!! This honestly seems more like an Azure bug than anything else, as it doesn't make sense why the behavior would be different. If you have a fix for working around this in CAPZ, I think it would be appropriate to incorporate it. @jackfrancis @nojnhuh can you shed some light on this as well in case I'm missing something?
Investigating this again a bit more in detail. What I found out so far is that my initial conclusion was only halfway right. PATCH also creates a new model when we supply de CustomData (which is not included when manually copying the VMSS JSON and making a request as I did earlier).
It seems the diff is applied slightly more granular. I added in my fork a change to completely remove the vmss.Properties
:
func (s *ScaleSetSpec) existingParameters(ctx context.Context, existing interface{}) (parameters interface{}, err error) {
// [..snip..]
// If there are no model changes and no increase in the replica count, do not update the VMSS.
// Decreases in replica count is handled by deleting AzureMachinePoolMachine instances in the MachinePoolScope
if *vmss.SKU.Capacity <= existingInfraVMSS.Capacity && !hasModelChanges && !s.ShouldPatchCustomData {
// up to date, nothing to do
return nil, nil
}
// if there are no model changes and no change in custom data, get rid of all properties to avoid unnecessary VMSS model
// updates.
if !hasModelChanges && !s.ShouldPatchCustomData { // <-- these lines are new
vmss.Properties = nil
}
return vmss, nil
}
This seems to work so far. It's not yet ready to review because hasModelChanges and ShouldPatchCustomData don't really test all possible differences. Therefore we might need some more elaborate diff test.
This also is visible on the change history for that particular VMSS. After applying that change, whenever a capacity update has been done, the properties.VirtualMachineProfile.timeCreated property is not updated anymore. That's probably where the root cause is.
/kind bug
What steps did you take and what happened: When scaling up and down a MachinePool, it eventually reaches the point where Azure sends the mentioned error:
At this point, the VMSS can't be scaled anymore unless we manually press the update button within the portal (or do so via az CLI).
I believe most of the changes just come from bootstrap token TTL updates but I'm not sure since I haven't yet figured out how to compare/diff the model versions.
What did you expect to happen: VMSS can continue to scale without issues.
Anything else you would like to add: This issue may be tangentially related to #2975 since we might need to reflect the image model status based on what Azure API says, and not our own logic.
A few questions for those who are more versed in Azure and CAPZ in general:
Environment:
kubectl version
): 1.28.5/etc/os-release
): Linux/Windows