Closed evanfoster closed 1 day ago
/area provider/azure /area cluster-autoscaler
cc @jackfrancis @comtalyst PTAL!
@evanfoster what verbosity level are the cluster-autoscaler logs set to?
Is there a reason you're using the "azure" cluster-autoscaler provider and not the "clusterapi" provider? Cluster API does support the "cluster.x-k8s.io/replicas-managed-by"
MachinePool annotation, but a cleaner solution would be to have cluster-autoscaler communicate directly with Cluster API to scale machines.
Hey @jackfrancis,
I'll get back to you w.r.t. the logging level.
We're not using CAPI (or AKS) with this cluster. The cluster is entirely manually managed using some bespoke software and is built on top of plain Azure VMSSes and VMs. It's very old-school, so the Cluster Autoscaler has to talk directly to Azure's VMSS scaling API.
@evanfoster thanks for the clarification, that helps!
If we can get log output at verbosity of at least 4 we will have more information as to what pattern is causing this pathology.
Shoot, we're on level 3. We don't have a good way to repro this and the consequences of it are fairly catastrophic, so I'm hoping I can grab some other data from outside the CAS to pin things down a bit more. I'll also raise the log level to 4 just in case.
EDIT: We're using Helm to deploy the CAS. Will this suffice? extraArgs: {"logtostderr":true,"stderrthreshold":"info","v":4}
EDIT: yep, v: "4"
should do it. We're already explicitly setting v: "3"
.
I have an idea of what the issue might be.
The current VMSS size is proactively decremented here, regardless of whether or not the instance deletion was a success: https://github.com/kubernetes/autoscaler/blob/daeb681311f0d14688e499ba59e8c43ee089e7da/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L437-L442
On repeated failed deletion attempts, scaleSet.curSize
is rapidly decremented until it reaches 0, or even negative numbers in some cases. I wasn't able to find anything else that decrements curSize
, so this seems to be a smoking gun.
I propose that instead of directly mutating curSize
preemptively, we evaluate what the proposed state would look like and do the VMSS min size check before we schedule VMSS instance deletion futures. That's a terrible description, so I'll write what I think might solve the issue (with some major caveats that I've included as comments):
func (scaleSet *ScaleSet) waitForDeleteInstances(future *azure.Future, requiredIds *compute.VirtualMachineScaleSetVMInstanceRequiredIDs) {
// Evan: ...snip...
klog.V(3).Infof("Calling virtualMachineScaleSetsClient.WaitForDeleteInstancesResult(%v) for %s", requiredIds.InstanceIds, scaleSet.Name)
httpResponse, err := scaleSet.manager.azClient.virtualMachineScaleSetsClient.WaitForDeleteInstancesResult(ctx, future, scaleSet.manager.config.ResourceGroup)
isSuccess, err := isSuccessHTTPResponse(httpResponse, err)
if isSuccess {
scaleSet.sizeMutex.Lock()
// Evan: Somehow get the instance count. I'm assuming the future doesn't contain it
// since the synchronous method only returns an error. If that's the case, then we'll
// have to pass the number of instances in from DeleteInstances. I'm not sure how this
// would handle a partial failure, where only a subset of the instances were ultimately deletable.
scaleSet.curSize -= instanceCount
scaleSet.lastSizeRefresh = time.Now()
scaleSet.sizeMutex.Unlock()
klog.V(3).Infof("virtualMachineScaleSetsClient.WaitForDeleteInstancesResult(%v) for %s success", requiredIds.InstanceIds, scaleSet.Name)
return
}
// Evan: ...snip...
func (scaleSet *ScaleSet) DeleteInstances(instances []*azureRef, hasUnregisteredNodes bool) error {
// Evan: ...snip...
instanceIDs := []string{}
for _, instance := range instancesToDelete {
instanceID, err := getLastSegment(instance.Name)
if err != nil {
klog.Errorf("getLastSegment failed with error: %v", err)
return err
}
instanceIDs = append(instanceIDs, instanceID)
}
// Proactively check our size so that we don't
// go below minimum node count if cache data is stale
// only do it for non-unregistered nodes
if !hasUnregisteredNodes {
scaleSet.sizeMutex.Lock()
proposedSize := scaleSet.curSize - int64(len(instanceIDs))
if proposedSize <= scaleSet.MinSize() {
// Evan: Using some heuristic, pick the nodes that make the most sense to not delete.
// I lack the context to do that, so I'm just going to pop the delta off of the slice.
// Also, there are also probably type errors here.
sizeDelta := scaleSet.MinSize() - proposedSize
klog.V(4).Infof("Deleting %d nodes would put VMSS %q below min size. Removing %d nodes from those to be deleted.", len(instanceIDs), commonAsg, sizeDelta)
instanceIDs = instanceIDs[:len(instanceIDs) - sizeDelta] // This could have an off-by-one
}
scaleSet.sizeMutex.Unlock()
}
requiredIds := &compute.VirtualMachineScaleSetVMInstanceRequiredIDs{
InstanceIds: &instanceIDs,
}
scaleSet.instanceMutex.Lock()
klog.V(3).Infof("Calling virtualMachineScaleSetsClient.DeleteInstancesAsync(%v)", requiredIds.InstanceIds)
future, rerr := scaleSet.manager.azClient.virtualMachineScaleSetsClient.DeleteInstancesAsync(ctx, resourceGroup, commonAsg.Id(), *requiredIds, false)
scaleSet.instanceMutex.Unlock()
if rerr != nil {
klog.Errorf("virtualMachineScaleSetsClient.DeleteInstancesAsync for instances %v failed: %v", requiredIds.InstanceIds, rerr)
return rerr.Error()
}
The underlying cache doesn't seem to be the problem here. I really think the issue is just that we're decrementing curSize
too opportunistically. Something else that makes me think this is the case is what I saw this line log: https://github.com/kubernetes/autoscaler/blob/daeb681311f0d14688e499ba59e8c43ee089e7da/cluster-autoscaler/cloudprovider/azure/azure_scale_set.go#L168
I1101 14:35:40.211471 1 azure_scale_set.go:177] VMSS: vmss-<elided>, in-memory size: -4, new size: 1
At 14:35, there really only was one instance running, based off of the number of running Kubelets in the cluster. The in-memory size (based off of curSize
) was impossibly wrong. Unless it's possible for https://pkg.go.dev/github.com/Azure/azure-sdk-for-go@v68.0.0+incompatible/services/compute/mgmt/2022-08-01/compute#Sku to return a negative number, I think we can rule out an underlying cache problem.
(As an aside, I think I see why you wanted the log level set at 4. I'm guessing it was to see what azureCache.regenerate
logs.)
@evanfoster feel free to submit a PR! it is clear you have a better handle on this than anyone at this point 😄 💪
I may end up doing that, although I'll be on PTO next week and there are some areas I'm missing context around. Specifically, I have two concerns. The first is this:
// Evan: Using some heuristic, pick the nodes that make the most sense to not delete.
// I lack the context to do that, so I'm just going to pop the delta off of the slice.
I imagine there's probably a more optimal way of choosing which nodes to avoid deleting. Chopping the tail off like that is easy, but I just worry that there may be a better way to choose nodes. I'm also not sure what effects the caller of ScaleSet.DeleteInstances
will experience if we don't delete all the instances they asked us to. This isn't a huge deal, but I do worry about it.
Next, there's this:
// Evan: Somehow get the instance count. I'm assuming the future doesn't contain it
// since the synchronous method only returns an error. If that's the case, then we'll
// have to pass the number of instances in from DeleteInstances. I'm not sure how this
// would handle a partial failure, where only a subset of the instances were ultimately deletable.
Passing the instance count into ScaleSet.waitForDeleteInstances
doesn't bother me overly much. My main concern is about whether or not partial failures can happen, and what we do with ScaleSet.curSize
in that case. If we can never have a partial failure then there should be no problem at all. If partial failures can occur, then we'd need to know how many succeeded and how many failed. If we can't accurately determine that, we might have to force a cache refresh to get the true number from Azure, which is less than ideal (having experienced the fallout from API ratelimiting before the caching mechanism was introduced). It's a bit hard for me to get the exact details on what the azure.Future
really returns in this method. I'm decent enough with Go, but I wouldn't consider myself an expert by any means. Is there a way to get the details of what succeeded vs what failed from this method? Or is forcing a value re-read based off of a cache refresh the best way to go?
@willie-yao would you like to pick this up next week? we'll want to very that this is O.K. for the AKS implementations as well
cc @comtalyst @tallaxes
@jackfrancis Sure!
/assign
DeleteInstances()
is only called by DeleteNodes()
below. It also checks min size for the group and seems to reject immediately if it goes too low.
// GetScaleSetSize gets Scale Set size.
func (scaleSet *ScaleSet) GetScaleSetSize() (int64, error) {
return scaleSet.getCurSize()
}
// DeleteNodes deletes the nodes from the group.
func (scaleSet *ScaleSet) DeleteNodes(nodes []*apiv1.Node) error {
klog.V(8).Infof("Delete nodes requested: %q\n", nodes)
size, err := scaleSet.GetScaleSetSize()
if err != nil {
return err
}
if int(size) <= scaleSet.MinSize() {
return fmt.Errorf("min size reached, nodes will not be deleted")
}
...
return scaleSet.DeleteInstances(refs, hasUnregisteredNodes)
}
For @evanfoster's concern on heuristic, I don't think we need to check min size again---unless there are race condition concerns, in which we could use the same rejection logic? I don't think provider-level optimization for this is needed or safe (e.g., if we cannot "delete all the instances they asked us to", we would rather return an error to make it clear), as the decision is made by the core, and already taken min size into account.
Although, in the more recent version of the code, we now invalidate instance cache whenever waitForDeleteInstances()
ends with error. If that cache invalidation will eventually restores consistency, maybe this bug is already fixed?
I recall seeing the invalidate instance cache being called, but the issue still occurred with most of the VMSS being deleted despite seeing that "invalidateInstanceCache" being called that in a recent CSO.
This particular log line was encountered multiple times before the VMSS ultimately was asked to "scale up" to a number smaller than the current scaleset size
E1003 22:15:28.247174 1 azure_scale_set.go:200] Failed to update the capacity for vmss SCALESET_NAME_ELIDED with error Code="CannotAllocateRemainingVMsInAvailabilitySet" Message="Delete/Deallocate operation on VM 'SCALESET_NAME_ELIDED_660' failed because the remaining VMs in the Availability Set '_SCALESET_NAME_ELIDED_SUBSCRIPTION_ID_ELIDED' cannot be allocated together. Changes in Availability Set allocation need to be executed atomically. Please deallocate or delete some or all of these VMs before retrying the current operation. \r\nDetails: Allocation failed. If you are trying to add a new VM to a Virtual Machine Scale Set with a single placement group or update/resize an existing VM in a Virtual Machine Scale Set with a single placement group, please note that such allocation is scoped to a single cluster, and it is possible that the cluster is out of capacity. Please read more about improving likelihood of allocation success at http://aka.ms/allocation-guidance." Target="660", invalidate the cache so as to get the real size from API
If you could reproduce it, ideally from the latest version of the code, could the logs be captured? One useful thing to know is the timing; for example, if cache invalidation is useful at all, waitForDeleteInstances()
getting error, triggering invalidation, might comes too late.
In addition, I'm suspecting that cache invalidation might indeed have an issue, as it refreshes scale set (instance) cache in azure_scale_set.go, but not Azure cache in azure_cache.go. The prior is apparently relying on the latter.
func (scaleSet *ScaleSet) getVMSSFromCache() (compute.VirtualMachineScaleSet, error) {
allVMSS := scaleSet.manager.azureCache.getScaleSets()
if _, exists := allVMSS[scaleSet.Name]; !exists {
return compute.VirtualMachineScaleSet{}, fmt.Errorf("could not find vmss: %s", scaleSet.Name)
}
return allVMSS[scaleSet.Name], nil
}
func (scaleSet *ScaleSet) getCurSize() (int64, error) {
scaleSet.sizeMutex.Lock()
defer scaleSet.sizeMutex.Unlock()
set, err := scaleSet.getVMSSFromCache()
...
// after refresh
vmssSizeMutex.Lock()
curSize := *set.Sku.Capacity
vmssSizeMutex.Unlock()
...
}
func (m *azureCache) getScaleSets() map[string]compute.VirtualMachineScaleSet {
m.mutex.Lock()
defer m.mutex.Unlock()
return m.scaleSets
}
// this one wouldn't get called by the earlier refresh
func (m *azureCache) fetchAzureResources() error {
m.mutex.Lock()
defer m.mutex.Unlock()
vmssResult, err := m.fetchScaleSets()
if err != nil {
return err
}
m.scaleSets = vmssResult
...
}
If that's the case, I think it might be simpler to fix cache not refreshing completely instead (e.g., invalidate Azure cache in addition to Azure cache), at least for now. Something else might be relying on the assumption that the proposal will change (e.g., some calculations might expect/benefit from node count getting deducted fast). It will take some time to validate that.
Which component are you using?: cluster-autoscaler
What version of the component are you using?:
Component version: v1.28.6
What k8s version are you using (
kubectl version
)?:kubectl version
output may not be accurate, since our data is from a month ago. Instead, I've pulled the cluster version from the logs:v1.29.6
What environment is this in?: Azure, manually running a k8s cluster on top of VMSSes (not AKS).
What did you expect to happen?: We expected the cluster autoscaler to scale a VMSS up to a size larger than it currently was.
What happened instead?: The cluster autoscaler mistakenly resized a VMSS from 86 nodes down to 53, based on an erroneous in-memory value showing that the VMSS had 48 nodes.
How to reproduce it (as minimally and precisely as possible): Reproing this has been really difficult. We're not sure exactly what causes this to occur. Each time we've seen it happen, it has been preceded by errors from the Azure API. Specifically, the error rate for
vmss_wait_for_delete_instances_result
calls increases to 100%, with the CAS reporting node deletion timeouts along with other errors.Anything else we need to know?:
We saw some errors like this prior to the scale-down event:
The CAS really wanted to delete this instance (along with 3-4 others). It kept trying to delete this instance over and over and over again.
We also saw the in-memory node count for the VMSS steadily decrease, and this decrease seems to be correlated with the reduction:
Here are the relevant logs bracketing that last line.
``` I1003 22:39:43.260997 1 azure_scale_set.go:149] VMSS: vmss-At 22:40, the CAS logged this:
At the time, the VMSS in question had 86 healthy instances, not 48. The graph below shows the CAS Ready/Unregistered stats, versus the number of nodes based on unique running kubelets. 5 seconds after the CAS scaled up the VMSS from its supposed size of 48 to 53, the VMSS began shrinking from 86 down to 53.
After the scale-down finished, we started seeing occurrences of logs like this:
It really seems like the CAS is losing track of the instance count it's getting from Azure.