microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
185 stars 145 forks source link

Unable to deploy a 'Compute Instance' User Resource to a Workspace AML Service #4151

Open dram1964 opened 2 days ago

dram1964 commented 2 days ago

Deployment of AML Compute Instance fails

When adding a Compute Instance to a TRE Workspace AML Service, the deployment fails with the following error: desired number of dedicated nodes could not be allocated. This error has been happening consistently for the past two days. Have not tried it before then with this version of the TRE.

This error occurs when deploying via:

  1. the TRE UI using the 'aml_compute' user-resource template and
  2. Logging into the AML Workspace from a workspace VM and trying to create new compute instance

Steps to reproduce

  1. Create New Workspace and User
  2. Add User to 'Workspace Owners'
  3. Add User to 'Workspace Researchers'
  4. Login to TRE UI with User account
  5. Add Virtual Desktops (Guacamole) Service to Workspace
  6. Add a User Resource (VM) to Virtual Desktops Service
  7. Add Azure ML Service to Workspace ('expose externally' = False)
  8. Add a Compute instance User Resource to AML Service

Additional Steps taken

  1. Grant User 'Network Contributor' on the TRE Workspace VNet
  2. Grant User 'AzureML Compute Operator' on the Workspace AML Workspace

Additional Info

Azure TRE release version: v0.19.1 tre-workspace-base: 1.5.7 tre-service-azureml: 0.8.11 tre-user-resource-aml-compute-instance: 0.5.7 deployment location: UKSouth

tim-allen-ck commented 2 days ago

Hi @dram1964, can you create an AML in the portal manually?

dram1964 commented 2 days ago

Hi @tim-allen-ck - logged-in as the Global Admin for the tenant, I've created an AML workspace with basic settings (public access) in the UK South region and added a compute which completed in 5 minutes or so. My efforts via the TRE usually take around 30 minutes before they report a failure.

I could try to repeat the exercise using an adjusted version of the terraform code from the AML workspace service if that would be useful. Should I use the same credentials as I have in the TRE code?

dram1964 commented 2 days ago

Interesting development - Decided to re-deploy the AML Service into a workspace, this time with expose externally set to True. When I tried to add a compute instance from the user resource template it succeeded, and I can connect and run code on it.

tim-allen-ck commented 1 day ago

Could potentially be something to do with private endpoints within the vnet?