microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
169 stars 133 forks source link

Inconsistent VM state between TRE and Azure portal leaves running VMs that can't be managed via the TRE #4020

Open TonyWildish-BH opened 4 days ago

TonyWildish-BH commented 4 days ago

Describe the bug We updated our TRE to allow users access to other instance types. A user attempted to deploy an instance (still using the standard Ubuntu 18.04 image), and the deployment reported a timeout from the Azure portal, which is reflected in the TRE portal. The error message is:

Code="OSProvisioningTimedOut" Message="OS Provisioning for VM 'linuxvm420a' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. For details on how to check current provisioning state of Windows VMs, refer to https://aka.ms/WindowsVMLifecycle and Linux VMs, refer to https://aka.ms/LinuxVMLifecycle."

After that, all attempted actions via the TRE portal fail for this VM - upgrade, delete, start, stop... The machine is no longer controllable via the TRE portal.

Despite that, the machine did finally boot, and since it's a GPU-based instance and this happened on a Friday afternoon, it cost us over £200 running as an idle ghost. We only noticed it because we have daily cost report emails on our subscription, so saw it this morning.

Note that we have succeeded in booting these GPU instances with this Linux image, so the error is not a hard fail. It's a soft timeout, which leaves the TRE in an inconsistent state.

Steps to reproduce

  1. Add GPU-based instances to your TRE
  2. Boot one up
  3. Wait for one to fail with timeout.
  4. Rinse and repeat until it fails.

Azure TRE release version (e.g. v0.14.0 or main):

Deployed Azure TRE components - click the (i) in the UI: UI Version: 0.5.21 API Version: 0.18.5

marrobi commented 4 days ago

@TonyWildish-BH what image were you deploying onto the VM, and was there any bootstrap scripts? Thanks.