Inconsistent VM state between TRE and Azure portal leaves running VMs that can't be managed via the TRE

Describe the bug We updated our TRE to allow users access to other instance types. A user attempted to deploy an instance (still using the standard Ubuntu 18.04 image), and the deployment reported a timeout from the Azure portal, which is reflected in the TRE portal. The error message is:

Code="OSProvisioningTimedOut" Message="OS Provisioning for VM 'linuxvm420a' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle."

After that, all attempted actions via the TRE portal fail for this VM - upgrade, delete, start, stop... The machine is no longer controllable via the TRE portal.

Despite that, the machine did finally boot, and since it's a GPU-based instance and this happened on a Friday afternoon, it cost us over £200 running as an idle ghost. We only noticed it because we have daily cost report emails on our subscription, so saw it this morning.

Note that we have succeeded in booting these GPU instances with this Linux image, so the error is not a hard fail. It's a soft timeout, which leaves the TRE in an inconsistent state.

Steps to reproduce

Add GPU-based instances to your TRE
Boot one up
Wait for one to fail with timeout.
Rinse and repeat until it fails.

Azure TRE release version (e.g. v0.14.0 or main):

Deployed Azure TRE components - click the (i) in the UI: UI Version: 0.5.21 API Version: 0.18.5

microsoft / AzureTRE

Inconsistent VM state between TRE and Azure portal leaves running VMs that can't be managed via the TRE #4020