Open nharper285 opened 2 years ago
The VMSS VM Resource in ARM does have a LatestModelApplied
value. We could treat this the same as our internal ReimageRequested
property and perform a reimage at check time in CanProcessNewWork
, which would bring it in line with the latest model.
e.g. the check would become:
if (node.ReimageRequested || !await IsOnLatestModel(node)) {
_logTracer.Info($"can_process_new_work is set to be reimaged. machine_id:{node.MachineId}");
await Stop(node, done: true);
return false;
}
Where IsOnLatestModel
is something like:
private async Async.Task<bool> IsOnLatestModel(Node node) {
if (node.ScalesetId is null) {
return true;
}
var client = _context.Creds.ArmClient;
var id = VirtualMachineScaleSetVmResource.CreateResourceIdentifier(
_context.Creds.GetSubscription(),
_context.Creds.GetBaseResourceGroup(),
node.ScalesetId.ToString(),
node.MachineId.ToString() /* TODO: this is wrong, machine ID not Instance ID */);
var response = await client.GetVirtualMachineScaleSetVmResource(id).GetAsync();
return response.Value.Data.LatestModelApplied ?? true;
}
It would also be good to identify what we are actually changing in the model which is causing the creation of new models, more documentation here: https://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-upgrade-scale-set
We have an issue that has come up several times where we have a large scalesets (>200 nodes) that have been running for a long time.
In short, there are several models running across the scaleset. When timer_workers tries to sync/upgrade/update scalesets, it hits this exception and fails. In the past (before autoscaling), this would happen as VMSS nodes would update asynchronously over time after finishing work. Now nodes are spun up when new work is taken on and destroyed afterward. This doesn't solve the issue either. If ten different jobs start over the course of a week, and we've had ten separate version updates to a single VMSS, then each set of nodes that's spun up for each job will be a different version.
Ultimately, the main issue with this exception is that it prevents timer_workers from running completely, and thus, prevents other scalestet node updates/upgrades/etc.
When we've encountered this in the past, we've had to kill the scaleset entirely. We thought autoscaling would properly avoid the issue (nodes would spin up and down enough and never have more than 10 models running), but it obviously hasn't.
Marc and I have tried looking into how we could best track the number of models, but there isn't a config value that tracks this information for VMs or VMSS.
Solutions: