Virtual Machine Scale Set has reached its limit of 10 models that may be referenced by one or more VMs belonging to the Virtual Machine Scale Set.

We have an issue that has come up several times where we have a large scalesets (>200 nodes) that have been running for a long time.

Exception while executing function: Functions.timer_workers Result: Failure
Exception: HttpResponseError: (BadRequest) Virtual Machine Scale Set '90d75d56-754b-4336-86e0-fa911580a524' has reached its limit of 10 models that may be referenced by one or more VMs belonging to the Virtual Machine Scale Set. Upgrade the VMs to the latest model of the Virtual Machine Scale Set before trying again.
Code: BadRequest
Message: Virtual Machine Scale Set '90d75d56-754b-4336-86e0-fa911580a524' has reached its limit of 10 models that may be referenced by one or more VMs belonging to the Virtual Machine Scale Set. Upgrade the VMs to the latest model of the Virtual Machine Scale Set before trying again.
Stack:   File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py", line 406, in _handle__invocation_request
    call_result = await self._loop.run_in_executor(
  File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/dispatcher.py", line 648, in _run_sync_func
    return ExtensionManager.get_sync_invocation_wrapper(context,
  File "/azure-functions-host/workers/python/3.8/LINUX/X64/azure_functions_worker/extension.py", line 215, in _raw_invocation_wrapper
    result = function(**args)
  File "/home/site/wwwroot/timer_workers/__init__.py", line 63, in main
    process_scaleset(scaleset)
  File "/home/site/wwwroot/timer_workers/__init__.py", line 21, in process_scaleset
    scaleset.update_configs()
  File "/home/site/wwwroot/onefuzzlib/workers/scalesets.py", line 901, in update_configs
    update_extensions(self.scaleset_id, extensions)
  File "/home/site/wwwroot/onefuzzlib/azure/creds.py", line 245, in decorated
    return func(*args, **kwargs)
  File "/home/site/wwwroot/onefuzzlib/azure/vmss.py", line 305, in update_extensions
    compute_client.virtual_machine_scale_sets.begin_update(
  File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/core/tracing/decorator.py", line 83, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/mgmt/compute/v2021_07_01/operations/_virtual_machine_scale_sets_operations.py", line 1238, in begin_update
    raw_result = self._update_initial(
  File "/home/site/wwwroot/.python_packages/lib/site-packages/azure/mgmt/compute/v2021_07_01/operations/_virtual_machine_scale_sets_operations.py", line 1187, in _update_initial
    raise HttpResponseError(response=response, error_format=ARMErrorFormat)

In short, there are several models running across the scaleset. When timer_workers tries to sync/upgrade/update scalesets, it hits this exception and fails. In the past (before autoscaling), this would happen as VMSS nodes would update asynchronously over time after finishing work. Now nodes are spun up when new work is taken on and destroyed afterward. This doesn't solve the issue either. If ten different jobs start over the course of a week, and we've had ten separate version updates to a single VMSS, then each set of nodes that's spun up for each job will be a different version.

Ultimately, the main issue with this exception is that it prevents timer_workers from running completely, and thus, prevents other scalestet node updates/upgrades/etc.

When we've encountered this in the past, we've had to kill the scaleset entirely. We thought autoscaling would properly avoid the issue (nodes would spin up and down enough and never have more than 10 models running), but it obviously hasn't.

Marc and I have tried looking into how we could best track the number of models, but there isn't a config value that tracks this information for VMs or VMSS.

Solutions:

Short-term: I think the immediate solution is to log the exception as a warning and allow the function and scaleset to continue on.
Medium term: Catch the model limit exception, we shutdown the scaleset that threw that exception. Spin up a new scaleset with the same config as the one we just shutdown under the same pool.
- Reasoning: We don't care about exceptions in the old scaleset since it's being shutdown. It won't pick up new work anymore since it's shutdown. Within a job, it doesn't matter which scalesets the tasks run on. In the new scaleset we have a new model limit. We have no limit on number of scalesets within a pool.
Long-term: I want to discuss this in the meeting, but one solution would be to implement 'resumable` tasks that can survive updates. This would allow all nodes in a scaleset to process updates immediately and sync up across the board. AB#36037

microsoft / onefuzz

Virtual Machine Scale Set has reached its limit of 10 models that may be referenced by one or more VMs belonging to the Virtual Machine Scale Set. #2318