kubernetes-sigs / cluster-api-provider-azure

Cluster API implementation for Microsoft Azure
https://capz.sigs.k8s.io/
Apache License 2.0
295 stars 425 forks source link

VMSS reaches limit number of models (10) and can't be scaled anymore #4958

Closed mweibel closed 1 week ago

mweibel commented 4 months ago

/kind bug

What steps did you take and what happened: When scaling up and down a MachinePool, it eventually reaches the point where Azure sends the mentioned error:

Virtual Machine Scale Set '{name}' has reached its limit of 10 models that may be referenced by one or more VMs belonging to the Virtual Machine Scale Set. Upgrade the VMs to the latest model of the Virtual Machine Scale Set before trying again.

At this point, the VMSS can't be scaled anymore unless we manually press the update button within the portal (or do so via az CLI).

I believe most of the changes just come from bootstrap token TTL updates but I'm not sure since I haven't yet figured out how to compare/diff the model versions.

What did you expect to happen: VMSS can continue to scale without issues.

Anything else you would like to add: This issue may be tangentially related to #2975 since we might need to reflect the image model status based on what Azure API says, and not our own logic.

A few questions for those who are more versed in Azure and CAPZ in general:

  1. Rolling update strategy deletes VMs not running the latest model. There's however also the possibility to update a VM which may or may not provoke a reboot. Shouldn't we do this instead of deleting?
  2. Could VMSS Flex help with this issue? Haven't yet tried that out but I plan to.

Environment:

mweibel commented 3 months ago

FYI currently testing this change: https://github.com/helio/cluster-api-provider-azure/commit/73cdc0db8f2aaba2e9ab0d4b20939d3569d0173c

mweibel commented 2 months ago

The change I did seems to somewhat work although with high scale it comes at it's limits because it only works if there no failed/evicted VMSS VMs.

I ran a quick experiment related to the VMSS PUT API.

  1. copied the JSON data from the VMSS and added it to a file
  2. removed immutable properties (name, id, ...)
  3. executed a PUT az rest --method put --url '/subscriptions/{id}/resourcegroups/{rg}/providers/Microsoft.Compute/virtualMachineScaleSets/{vmssName}?api-version=2024-07-01' --body @vmss.json --verbose -o json

Looking at the VMSS right afterwards, I see that the VMSS VMs now report "Latest Model: No". This even though no changes have been made at all.

Is this an issue with the VMSS API or does CAPZ need to verify no changes have been made before executing a CreateOrUpdate on the VMSS?

mweibel commented 2 months ago

Looking further into this: when doing an instance scale in the VMSS using Azure Portal, it uses the VMSS PATCH API instead of PUT. This API behaves different in a way that the "Latest model" property is not set to "No" on existing instances. Why this is the case is beyond my knowledge, but it means that CAPZ could generate a patch from the existing VMSS parameters to the new ones and execute a patch call instead.

willie-yao commented 2 months ago

/priority backlog

willie-yao commented 2 months ago

it means that CAPZ could generate a patch from the existing VMSS parameters to the new ones and execute a patch call instead.

Apologies for the delay on this!! This honestly seems more like an Azure bug than anything else, as it doesn't make sense why the behavior would be different. If you have a fix for working around this in CAPZ, I think it would be appropriate to incorporate it. @jackfrancis @nojnhuh can you shed some light on this as well in case I'm missing something?

mweibel commented 1 month ago

Investigating this again a bit more in detail. What I found out so far is that my initial conclusion was only halfway right. PATCH also creates a new model when we supply de CustomData (which is not included when manually copying the VMSS JSON and making a request as I did earlier).

It seems the diff is applied slightly more granular. I added in my fork a change to completely remove the vmss.Properties:

func (s *ScaleSetSpec) existingParameters(ctx context.Context, existing interface{}) (parameters interface{}, err error) {
        // [..snip..]

    // If there are no model changes and no increase in the replica count, do not update the VMSS.
    // Decreases in replica count is handled by deleting AzureMachinePoolMachine instances in the MachinePoolScope
    if *vmss.SKU.Capacity <= existingInfraVMSS.Capacity && !hasModelChanges && !s.ShouldPatchCustomData {
        // up to date, nothing to do
        return nil, nil
    }

    // if there are no model changes and no change in custom data, get rid of all properties to avoid unnecessary VMSS model
    // updates.
    if !hasModelChanges && !s.ShouldPatchCustomData {                  // <-- these lines are new
        vmss.Properties = nil
    }

    return vmss, nil
}

This seems to work so far. It's not yet ready to review because hasModelChanges and ShouldPatchCustomData don't really test all possible differences. Therefore we might need some more elaborate diff test.

This also is visible on the change history for that particular VMSS. After applying that change, whenever a capacity update has been done, the properties.VirtualMachineProfile.timeCreated property is not updated anymore. That's probably where the root cause is.

image