CAS appears to lose track of instances in VMSS and makes incorrect scaling decisions

evanfoster commented 2 weeks ago

Which component are you using?: cluster-autoscaler

What version of the component are you using?:

Component version: v1.28.6

What k8s version are you using (kubectl version)?:

kubectl version output may not be accurate, since our data is from a month ago. Instead, I've pulled the cluster version from the logs: v1.29.6

What environment is this in?: Azure, manually running a k8s cluster on top of VMSSes (not AKS).

What did you expect to happen?: We expected the cluster autoscaler to scale a VMSS up to a size larger than it currently was.

What happened instead?: The cluster autoscaler mistakenly resized a VMSS from 86 nodes down to 53, based on an erroneous in-memory value showing that the VMSS had 48 nodes.

How to reproduce it (as minimally and precisely as possible): Reproing this has been really difficult. We're not sure exactly what causes this to occur. Each time we've seen it happen, it has been preceded by errors from the Azure API. Specifically, the error rate for vmss_wait_for_delete_instances_result calls increases to 100%, with the CAS reporting node deletion timeouts along with other errors.

Anything else we need to know?:

We saw some errors like this prior to the scale-down event:

E1003 22:15:28.247174       1 azure_scale_set.go:200] Failed to update the capacity for vmss <elided> with error Code="CannotAllocateRemainingVMsInAvailabilitySet" Message="Delete/Deallocate operation on VM 'vmss-<elided>_679' failed because the remaining VMs in the Availability Set '_vmss-<elided>_<elided>b3' cannot be allocated together. Changes in Availability Set allocation need to be executed atomically. Please deallocate or delete some or all of these VMs before retrying the current operation. \r\nDetails: Allocation failed. If you are trying to add a new VM to a Virtual Machine Scale Set with a single placement group or update/resize an existing VM in a Virtual Machine Scale Set with a single placement group, please note that such allocation is scoped to a single cluster, and it is possible that the cluster is out of capacity. Please read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance." Target="679", invalidate the cache so as to get the real size from API

The CAS really wanted to delete this instance (along with 3-4 others). It kept trying to delete this instance over and over and over again.

We also saw the in-memory node count for the VMSS steadily decrease, and this decrease seems to be correlated with the reduction:

I1003 22:31:58.325188       1 azure_scale_set.go:149] VMSS: vmss-<elided>, returning in-memory size: 93
I1003 22:32:51.759222       1 azure_scale_set.go:149] VMSS: vmss-<elided>, returning in-memory size: 88
I1003 22:33:53.728006       1 azure_scale_set.go:149] VMSS: vmss-<elided>, returning in-memory size: 82
I1003 22:34:50.949732       1 azure_scale_set.go:149] VMSS: vmss-<elided>, returning in-memory size: 78
I1003 22:35:58.258156       1 azure_scale_set.go:149] VMSS: vmss-<elided>, returning in-memory size: 72
I1003 22:36:57.201547       1 azure_scale_set.go:149] VMSS: vmss-<elided>, returning in-memory size: 67
I1003 22:37:54.935086       1 azure_scale_set.go:149] VMSS: vmss-<elided>, returning in-memory size: 61
I1003 22:38:57.801178       1 azure_scale_set.go:149] VMSS: vmss-<elided>, returning in-memory size: 57
I1003 22:39:43.711224       1 azure_scale_set.go:149] VMSS: vmss-<elided> returning in-memory size: 53

Here are the relevant logs bracketing that last line.

``` I1003 22:39:43.260997 1 azure_scale_set.go:149] VMSS: vmss-, returning in-memory size: 53 I1003 22:39:43.261054 1 azure_scale_set.go:149] VMSS: vmss-, returning in-memory size: 53 I1003 22:39:43.710852 1 clusterstate.go:1094] Failed adding 1 nodes (1 unseen previously) to group vmss- due to OutOfResource.provisioning-state-failed; errorMessages=[]string{"Azure failed to provision a node for this node group"} I1003 22:39:43.711211 1 static_autoscaler.go:854] Deleting 1 from vmss- node group because of create errors # original log line from above I1003 22:39:43.711224 1 azure_scale_set.go:149] VMSS: vmss-, returning in-memory size: 53 I1003 22:39:43.711258 1 azure_scale_set.go:376] Deleting vmss instances [azure:///subscriptions//resourceGroups//providers/Microsoft.Compute/virtualMachineScaleSets/vmss-/virtualMachines/679] I1003 22:39:43.711315 1 azure_scale_set.go:426] Calling virtualMachineScaleSetsClient.DeleteInstancesAsync(&[679]) I1003 22:39:43.894553 1 static_autoscaler.go:439] Some nodes that failed to create were removed, skipping iteration I1003 22:39:43.894587 1 azure_scale_set.go:184] Calling virtualMachineScaleSetsClient.WaitForDeleteInstancesResult(&[679]) for vmss- I1003 22:39:54.302812 1 azure_scale_set.go:149] VMSS: vmss-, returning in-memory size: 52 I1003 22:39:55.477245 1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache I1003 22:39:55.477336 1 azure_scale_set.go:149] VMSS: vmss-, returning in-memory size: 52 I1003 22:39:55.873546 1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 396.27342ms I1003 22:39:57.169224 1 azure_scale_set.go:149] VMSS: vmss-, returning in-memory size: 52 I1003 22:39:57.173325 1 clusterstate.go:1094] Failed adding 1 nodes (1 unseen previously) to group vmss- due to OutOfResource.provisioning-state-failed; errorMessages=[]string{"Azure failed to provision a node for this node group"} I1003 22:39:57.173740 1 static_autoscaler.go:854] Deleting 1 from vmss- node group because of create errors I1003 22:39:57.173754 1 azure_scale_set.go:149] VMSS: vmss-, returning in-memory size: 52 I1003 22:39:57.173794 1 azure_scale_set.go:376] Deleting vmss instances [azure:///subscriptions//resourceGroups//providers/Microsoft.Compute/virtualMachineScaleSets/vmss-/virtualMachines/679] I1003 22:39:57.173867 1 azure_scale_set.go:426] Calling virtualMachineScaleSetsClient.DeleteInstancesAsync(&[679]) I1003 22:39:57.330951 1 static_autoscaler.go:439] Some nodes that failed to create were removed, skipping iteration I1003 22:39:57.331031 1 azure_scale_set.go:184] Calling virtualMachineScaleSetsClient.WaitForDeleteInstancesResult(&[679]) for vmss- ```

At 22:40, the CAS logged this:

I1003 22:40:39.641141       1 azure_scale_set.go:149] VMSS: vmss-<elided>, returning in-memory size: 48
I1003 22:40:40.215536       1 orchestrator.go:295] Final scale-up plan: [{vmss-<elided> 48->53 (max: 1000)}]
I1003 22:40:40.215553       1 executor.go:147] Scale-up: setting group vmss-<elided> size to 53

At the time, the VMSS in question had 86 healthy instances, not 48. The graph below shows the CAS Ready/Unregistered stats, versus the number of nodes based on unique running kubelets. 5 seconds after the CAS scaled up the VMSS from its supposed size of 48 to 53, the VMSS began shrinking from 86 down to 53.

screenshot_013

After the scale-down finished, we started seeing occurrences of logs like this:

E1003 22:49:02.134021       1 azure_scale_set.go:191] virtualMachineScaleSetsClient.WaitForDeleteInstancesResult - DeleteInstances for instances &[679] for vmss-<elided> failed with error: Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded

It really seems like the CAS is losing track of the instance count it's getting from Azure.

Shubham82 commented 2 weeks ago

/area provider/azure /area cluster-autoscaler

Shubham82 commented 2 weeks ago

cc @jackfrancis @comtalyst PTAL!

jackfrancis commented 1 week ago

@evanfoster what verbosity level are the cluster-autoscaler logs set to?

Is there a reason you're using the "azure" cluster-autoscaler provider and not the "clusterapi" provider? Cluster API does support the "cluster.x-k8s.io/replicas-managed-by" MachinePool annotation, but a cleaner solution would be to have cluster-autoscaler communicate directly with Cluster API to scale machines.

evanfoster commented 1 week ago

Hey @jackfrancis,

I'll get back to you w.r.t. the logging level.

We're not using CAPI (or AKS) with this cluster. The cluster is entirely manually managed using some bespoke software and is built on top of plain Azure VMSSes and VMs. It's very old-school, so the Cluster Autoscaler has to talk directly to Azure's VMSS scaling API.

jackfrancis commented 1 week ago

@evanfoster thanks for the clarification, that helps!

If we can get log output at verbosity of at least 4 we will have more information as to what pattern is causing this pathology.

evanfoster commented 1 week ago

Shoot, we're on level 3. We don't have a good way to repro this and the consequences of it are fairly catastrophic, so I'm hoping I can grab some other data from outside the CAS to pin things down a bit more. I'll also raise the log level to 4 just in case.

EDIT: We're using Helm to deploy the CAS. Will this suffice? extraArgs: {"logtostderr":true,"stderrthreshold":"info","v":4}

EDIT: yep, v: "4" should do it. We're already explicitly setting v: "3".

evanfoster commented 1 week ago

I have an idea of what the issue might be.

The current VMSS size is proactively decremented here, regardless of whether or not the instance deletion was a success: https://github.com/kubernetes/autoscaler/blob/daeb681311f0d14688e499ba59e8c43ee089e7da/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L437-L442

On repeated failed deletion attempts, scaleSet.curSize is rapidly decremented until it reaches 0, or even negative numbers in some cases. I wasn't able to find anything else that decrements curSize, so this seems to be a smoking gun.

I propose that instead of directly mutating curSize preemptively, we evaluate what the proposed state would look like and do the VMSS min size check before we schedule VMSS instance deletion futures. That's a terrible description, so I'll write what I think might solve the issue (with some major caveats that I've included as comments):

func (scaleSet *ScaleSet) waitForDeleteInstances(future *azure.Future, requiredIds *compute.VirtualMachineScaleSetVMInstanceRequiredIDs) {
    // Evan: ...snip...
    klog.V(3).Infof("Calling virtualMachineScaleSetsClient.WaitForDeleteInstancesResult(%v) for %s", requiredIds.InstanceIds, scaleSet.Name)
    httpResponse, err := scaleSet.manager.azClient.virtualMachineScaleSetsClient.WaitForDeleteInstancesResult(ctx, future, scaleSet.manager.config.ResourceGroup)
    isSuccess, err := isSuccessHTTPResponse(httpResponse, err)
    if isSuccess {
        scaleSet.sizeMutex.Lock()
        // Evan: Somehow get the instance count. I'm assuming the future doesn't contain it
        // since the synchronous method only returns an error. If that's the case, then we'll
        // have to pass the number of instances in from DeleteInstances. I'm not sure how this
        // would handle a partial failure, where only a subset of the instances were ultimately deletable.
        scaleSet.curSize -= instanceCount
        scaleSet.lastSizeRefresh = time.Now()
        scaleSet.sizeMutex.Unlock()
        klog.V(3).Infof("virtualMachineScaleSetsClient.WaitForDeleteInstancesResult(%v) for %s success", requiredIds.InstanceIds, scaleSet.Name)
        return
    }
// Evan: ...snip...

func (scaleSet *ScaleSet) DeleteInstances(instances []*azureRef, hasUnregisteredNodes bool) error {
    // Evan: ...snip...
    instanceIDs := []string{}
    for _, instance := range instancesToDelete {
        instanceID, err := getLastSegment(instance.Name)
        if err != nil {
            klog.Errorf("getLastSegment failed with error: %v", err)
            return err
        }
        instanceIDs = append(instanceIDs, instanceID)
    }
    // Proactively check our size so that we don't
    // go below minimum node count if cache data is stale
    // only do it for non-unregistered nodes
    if !hasUnregisteredNodes {
        scaleSet.sizeMutex.Lock()
        proposedSize := scaleSet.curSize - int64(len(instanceIDs))
        if proposedSize <= scaleSet.MinSize() {
                // Evan: Using some heuristic, pick the nodes that make the most sense to not delete.
                // I lack the context to do that, so I'm just going to pop the delta off of the slice.
                // Also, there are also probably type errors here.
                sizeDelta := scaleSet.MinSize() - proposedSize
                klog.V(4).Infof("Deleting %d nodes would put VMSS %q below min size. Removing %d nodes from those to be deleted.", len(instanceIDs), commonAsg, sizeDelta)
                instanceIDs = instanceIDs[:len(instanceIDs) - sizeDelta] // This could have an off-by-one

        }
        scaleSet.sizeMutex.Unlock()
    }
    requiredIds := &compute.VirtualMachineScaleSetVMInstanceRequiredIDs{
        InstanceIds: &instanceIDs,
    }

    scaleSet.instanceMutex.Lock()
    klog.V(3).Infof("Calling virtualMachineScaleSetsClient.DeleteInstancesAsync(%v)", requiredIds.InstanceIds)
    future, rerr := scaleSet.manager.azClient.virtualMachineScaleSetsClient.DeleteInstancesAsync(ctx, resourceGroup, commonAsg.Id(), *requiredIds, false)
    scaleSet.instanceMutex.Unlock()
    if rerr != nil {
        klog.Errorf("virtualMachineScaleSetsClient.DeleteInstancesAsync for instances %v failed: %v", requiredIds.InstanceIds, rerr)
        return rerr.Error()
    }

The underlying cache doesn't seem to be the problem here. I really think the issue is just that we're decrementing curSize too opportunistically. Something else that makes me think this is the case is what I saw this line log: https://github.com/kubernetes/autoscaler/blob/daeb681311f0d14688e499ba59e8c43ee089e7da/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L168

I1101 14:35:40.211471       1 azure_scale_set.go:177] VMSS: vmss-<elided>, in-memory size: -4, new size: 1

At 14:35, there really only was one instance running, based off of the number of running Kubelets in the cluster. The in-memory size (based off of curSize) was impossibly wrong. Unless it's possible for https://pkg.go.dev/github.com/Azure/azure-sdk-for-go@v68.0.0+incompatible/services/compute/mgmt/2022-08-01/compute#Sku to return a negative number, I think we can rule out an underlying cache problem.

(As an aside, I think I see why you wanted the log level set at 4. I'm guessing it was to see what azureCache.regenerate logs.)

jackfrancis commented 1 week ago

@evanfoster feel free to submit a PR! it is clear you have a better handle on this than anyone at this point 😄 💪

evanfoster commented 1 week ago

I may end up doing that, although I'll be on PTO next week and there are some areas I'm missing context around. Specifically, I have two concerns. The first is this:

// Evan: Using some heuristic, pick the nodes that make the most sense to not delete.
// I lack the context to do that, so I'm just going to pop the delta off of the slice.

I imagine there's probably a more optimal way of choosing which nodes to avoid deleting. Chopping the tail off like that is easy, but I just worry that there may be a better way to choose nodes. I'm also not sure what effects the caller of ScaleSet.DeleteInstances will experience if we don't delete all the instances they asked us to. This isn't a huge deal, but I do worry about it.

Next, there's this:

// Evan: Somehow get the instance count. I'm assuming the future doesn't contain it
// since the synchronous method only returns an error. If that's the case, then we'll
// have to pass the number of instances in from DeleteInstances. I'm not sure how this
// would handle a partial failure, where only a subset of the instances were ultimately deletable.

Passing the instance count into ScaleSet.waitForDeleteInstances doesn't bother me overly much. My main concern is about whether or not partial failures can happen, and what we do with ScaleSet.curSize in that case. If we can never have a partial failure then there should be no problem at all. If partial failures can occur, then we'd need to know how many succeeded and how many failed. If we can't accurately determine that, we might have to force a cache refresh to get the true number from Azure, which is less than ideal (having experienced the fallout from API ratelimiting before the caching mechanism was introduced). It's a bit hard for me to get the exact details on what the azure.Future really returns in this method. I'm decent enough with Go, but I wouldn't consider myself an expert by any means. Is there a way to get the details of what succeeded vs what failed from this method? Or is forcing a value re-read based off of a cache refresh the best way to go?

jackfrancis commented 1 week ago

@willie-yao would you like to pick this up next week? we'll want to very that this is O.K. for the AKS implementations as well

cc @comtalyst @tallaxes

willie-yao commented 1 week ago

@jackfrancis Sure!

/assign

comtalyst commented 1 week ago

DeleteInstances() is only called by DeleteNodes() below. It also checks min size for the group and seems to reject immediately if it goes too low.

https://github.com/kubernetes/autoscaler/blob/daeb681311f0d14688e499ba59e8c43ee089e7da/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L454C1-L464C3

// GetScaleSetSize gets Scale Set size.
func (scaleSet *ScaleSet) GetScaleSetSize() (int64, error) {
    return scaleSet.getCurSize()
}

// DeleteNodes deletes the nodes from the group.
func (scaleSet *ScaleSet) DeleteNodes(nodes []*apiv1.Node) error {
    klog.V(8).Infof("Delete nodes requested: %q\n", nodes)
    size, err := scaleSet.GetScaleSetSize()
    if err != nil {
        return err
    }

    if int(size) <= scaleSet.MinSize() {
        return fmt.Errorf("min size reached, nodes will not be deleted")
    }

    ...

    return scaleSet.DeleteInstances(refs, hasUnregisteredNodes)
}

For @evanfoster's concern on heuristic, I don't think we need to check min size again---unless there are race condition concerns, in which we could use the same rejection logic? I don't think provider-level optimization for this is needed or safe (e.g., if we cannot "delete all the instances they asked us to", we would rather return an error to make it clear), as the decision is made by the core, and already taken min size into account.

comtalyst commented 1 week ago

Although, in the more recent version of the code, we now invalidate instance cache whenever waitForDeleteInstances() ends with error. If that cache invalidation will eventually restores consistency, maybe this bug is already fixed?

https://github.com/kubernetes/autoscaler/blob/213a8595ea2bddf433dd56e50c31ca868ef1da80/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L565

Starblade42 commented 1 week ago

I recall seeing the invalidate instance cache being called, but the issue still occurred with most of the VMSS being deleted despite seeing that "invalidateInstanceCache" being called that in a recent CSO.

This particular log line was encountered multiple times before the VMSS ultimately was asked to "scale up" to a number smaller than the current scaleset size

E1003 22:15:28.247174       1 azure_scale_set.go:200] Failed to update the capacity for vmss SCALESET_NAME_ELIDED with error Code="CannotAllocateRemainingVMsInAvailabilitySet" Message="Delete/Deallocate operation on VM 'SCALESET_NAME_ELIDED_660' failed because the remaining VMs in the Availability Set '_SCALESET_NAME_ELIDED_SUBSCRIPTION_ID_ELIDED' cannot be allocated together. Changes in Availability Set allocation need to be executed atomically. Please deallocate or delete some or all of these VMs before retrying the current operation. \r\nDetails: Allocation failed. If you are trying to add a new VM to a Virtual Machine Scale Set with a single placement group or update/resize an existing VM in a Virtual Machine Scale Set with a single placement group, please note that such allocation is scoped to a single cluster, and it is possible that the cluster is out of capacity. Please read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance." Target="660", invalidate the cache so as to get the real size from API

comtalyst commented 6 days ago

If you could reproduce it, ideally from the latest version of the code, could the logs be captured? One useful thing to know is the timing; for example, if cache invalidation is useful at all, waitForDeleteInstances() getting error, triggering invalidation, might comes too late.

In addition, I'm suspecting that cache invalidation might indeed have an issue, as it refreshes scale set (instance) cache in azure_scale_set.go, but not Azure cache in azure_cache.go. The prior is apparently relying on the latter.

azure_scale_set.go:

func (scaleSet *ScaleSet) getVMSSFromCache() (compute.VirtualMachineScaleSet, error) {
    allVMSS := scaleSet.manager.azureCache.getScaleSets()

    if _, exists := allVMSS[scaleSet.Name]; !exists {
        return compute.VirtualMachineScaleSet{}, fmt.Errorf("could not find vmss: %s", scaleSet.Name)
    }

    return allVMSS[scaleSet.Name], nil
}

func (scaleSet *ScaleSet) getCurSize() (int64, error) {
    scaleSet.sizeMutex.Lock()
    defer scaleSet.sizeMutex.Unlock()

    set, err := scaleSet.getVMSSFromCache()

    ...

    // after refresh
    vmssSizeMutex.Lock()
    curSize := *set.Sku.Capacity
    vmssSizeMutex.Unlock()

    ...
}

azure_cache.go:

func (m *azureCache) getScaleSets() map[string]compute.VirtualMachineScaleSet {
    m.mutex.Lock()
    defer m.mutex.Unlock()

    return m.scaleSets
}

// this one wouldn't get called by the earlier refresh
func (m *azureCache) fetchAzureResources() error {
    m.mutex.Lock()
    defer m.mutex.Unlock()

    vmssResult, err := m.fetchScaleSets()
    if err != nil {
        return err
    }
    m.scaleSets = vmssResult

    ...
}

If that's the case, I think it might be simpler to fix cache not refreshing completely instead (e.g., invalidate Azure cache in addition to Azure cache), at least for now. Something else might be relying on the assumption that the proposal will change (e.g., some calculations might expect/benefit from node count getting deducted fast). It will take some time to validate that.

kubernetes / autoscaler

CAS appears to lose track of instances in VMSS and makes incorrect scaling decisions #7432