microsoft / onefuzz

A self-hosted Fuzzing-As-A-Service platform
MIT License
2.82k stars 198 forks source link

Virtual Machine Scale Set has reached its limit of 10 models that may be referenced by one or more VMs belonging to the Virtual Machine Scale Set. #2318

Open nharper285 opened 2 years ago

nharper285 commented 2 years ago

We have an issue that has come up several times where we have a large scalesets (>200 nodes) that have been running for a long time.

Exception while executing function: Functions.timer_workers Result: Failure
Exception: HttpResponseError: (BadRequest) Virtual Machine Scale Set '90d75d56-754b-4336-86e0-fa911580a524' has reached its limit of 10 models that may be referenced by one or more VMs belonging to the Virtual Machine Scale Set. Upgrade the VMs to the latest model of the Virtual Machine Scale Set before trying again.
Code: BadRequest
Message: Virtual Machine Scale Set '90d75d56-754b-4336-86e0-fa911580a524' has reached its limit of 10 models that may be referenced by one or more VMs belonging to the Virtual Machine Scale Set. Upgrade the VMs to the latest model of the Virtual Machine Scale Set before trying again.
Stack:   File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py", line 406, in _handle__invocation_request
    call_result = await self._loop.run_in_executor(
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py", line 648, in _run_sync_func
    return ExtensionManager.get_sync_invocation_wrapper(context,
  File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/extension.py", line 215, in _raw_invocation_wrapper
    result = function(**args)
  File "/home/site/wwwroot/timer_workers/__init__.py", line 63, in main
    process_scaleset(scaleset)
  File "/home/site/wwwroot/timer_workers/__init__.py", line 21, in process_scaleset
    scaleset.update_configs()
  File "/home/site/wwwroot/onefuzzlib/workers/scalesets.py", line 901, in update_configs
    update_extensions(self.scaleset_id, extensions)
  File "/home/site/wwwroot/onefuzzlib/azure/creds.py", line 245, in decorated
    return func(*args, **kwargs)
  File "/home/site/wwwroot/onefuzzlib/azure/vmss.py", line 305, in update_extensions
    compute_client.virtual_machine_scale_sets.begin_update(
  File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/tracing/decorator.py", line 83, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/mgmt/compute/v2021_07_01/operations/_virtual_machine_scale_sets_operations.py", line 1238, in begin_update
    raw_result = self._update_initial(
  File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/mgmt/compute/v2021_07_01/operations/_virtual_machine_scale_sets_operations.py", line 1187, in _update_initial
    raise HttpResponseError(response=response, error_format=ARMErrorFormat)

In short, there are several models running across the scaleset. When timer_workers tries to sync/upgrade/update scalesets, it hits this exception and fails. In the past (before autoscaling), this would happen as VMSS nodes would update asynchronously over time after finishing work. Now nodes are spun up when new work is taken on and destroyed afterward. This doesn't solve the issue either. If ten different jobs start over the course of a week, and we've had ten separate version updates to a single VMSS, then each set of nodes that's spun up for each job will be a different version.

Ultimately, the main issue with this exception is that it prevents timer_workers from running completely, and thus, prevents other scalestet node updates/upgrades/etc.

When we've encountered this in the past, we've had to kill the scaleset entirely. We thought autoscaling would properly avoid the issue (nodes would spin up and down enough and never have more than 10 models running), but it obviously hasn't.

Marc and I have tried looking into how we could best track the number of models, but there isn't a config value that tracks this information for VMs or VMSS.

Solutions:

Porges commented 2 years ago

The VMSS VM Resource in ARM does have a LatestModelApplied value. We could treat this the same as our internal ReimageRequested property and perform a reimage at check time in CanProcessNewWork, which would bring it in line with the latest model.

e.g. the check would become:

        if (node.ReimageRequested || !await IsOnLatestModel(node)) {
            _logTracer.Info($"can_process_new_work is set to be reimaged. machine_id:{node.MachineId}");
            await Stop(node, done: true);
            return false;
        }

Where IsOnLatestModel is something like:

    private async Async.Task<bool> IsOnLatestModel(Node node) {
        if (node.ScalesetId is null) {
            return true;
        }

        var client = _context.Creds.ArmClient;
        var id = VirtualMachineScaleSetVmResource.CreateResourceIdentifier(
            _context.Creds.GetSubscription(),
            _context.Creds.GetBaseResourceGroup(),
            node.ScalesetId.ToString(),
            node.MachineId.ToString() /* TODO: this is wrong, machine ID not Instance ID */);

        var response = await client.GetVirtualMachineScaleSetVmResource(id).GetAsync();
        return response.Value.Data.LatestModelApplied ?? true;
    }

It would also be good to identify what we are actually changing in the model which is causing the creation of new models, more documentation here: https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-upgrade-scale-set