microsoft / AzureTRE

An accelerator to help organizations build Trusted Research Environments on Azure.
https://microsoft.github.io/AzureTRE
MIT License
169 stars 133 forks source link

Workspace Service Installs Fails with `PutSubnetOperation` or `CanceledAndSupersededDueToAnotherOperation` #3177

Open marrobi opened 1 year ago

marrobi commented 1 year ago

When deploying the Databricks Workspace service get:

2) Main step for ff51fffc-c2c1-4dfe-a88e-e70766f5bc3c
ff51fffc-c2c1-4dfe-a88e-e70766f5bc3c: Error message: ╷ │ Error: waiting for creation of Subnet: (Name "adb-host-subnet-mrtredemo24-ws-5740-svc-bc3c" / Virtual Network Name "vnet-mrtredemo24-ws-5740" / Resource Group "rg-mrtredemo24-ws-5740"): Code="Canceled" Message="Operation was canceled." Details=[{"code":"CanceledAndSupersededDueToAnotherOperation","message":"Operation PutSubnetOperation (a3d81bf2-68dd-4a93-93bb-3f3ad92059d9) was canceled and superseded by operation PutVirtualNetworkOperation (cc8eb20b-2242-4ae7-a1c2-1e74bbda5bfd)."}] │  │  with azurerm_subnet.host, │  on network.tf line 90, in resource "azurerm_subnet" "host": │  90: resource "azurerm_subnet" "host" { │  ╵ error running command /cnab/app/terraform
marrobi commented 1 year ago

@guybartal seen this before?

guybartal commented 1 year ago

no, I haven't. looks like it fails on creating the public (host) subnet, maybe a transient error? did you try to redeploy?

marrobi commented 8 months ago

Got this again here:

1f350b8a-736f-4ff8-9a5c-ca3bbc8c459a: Error message: ╷ │ Error: waiting for creation of Subnet: (Name "adb-host-subnet-mrtredemo28-ws-8044-svc-459a" / Virtual Network Name "vnet-mrtredemo28-ws-8044" / Resource Group "rg-mrtredemo28-ws-8044"): Code="Canceled" Message="Operation was canceled." Details=[{"code":"CanceledAndSupersededDueToAnotherOperation","message":"Operation PutSubnetOperation (f0dd77c7-05fd-4208-aa55-f62650568667) was canceled and superseded by operation PutVirtualNetworkOperation (b5f36438-7876-4e51-8e3a-36fc10f79daf)."}] │  │  with azurerm_subnet.host, │  on network.tf line 90, in resource "azurerm_subnet" "host": │  90: resource "azurerm_subnet" "host" { │  ╵ ╷ │ Error: Subnet: (Name "adb-container-subnet-mrtredemo28-ws-8044-svc-459a" / Virtual Network Name "vnet-mrtredemo28-ws-8044" / Resource Group "rg-mrtredemo28-ws-8044") was not found │  │  with azurerm_subnet_network_security_group_association.container, │  on network.tf line 147, in resource "azurerm_subnet_network_security_group_association" "container": │  147: resource "azurerm_subnet_network_security_group_association" "container" { │  ╵ ╷ │ Error: Subnet "adb-container-subnet-mrtredemo28-ws-8044-svc-459a" (Virtual Network "vnet-mrtredemo28-ws-8044" / Resource Group "rg-mrtredemo28-ws-8044") was not found! │  │  with azurerm_subnet_route_table_association.rt_container, │  on network.tf line 157, in resource "azurerm_subnet_route_table_association" "rt_container": │  157: resource "azurerm_subnet_route_table_association" "rt_container" { │  ╵ error running command /cnab/app/terraform /usr/bin/terraform apply -auto-approve -input=false -var address_space=10.1.8.0/24 -var arm_environment=public -var is_exposed_externally=false -var tre_id=mrtredemo28 -var tre_resource_id=1f350b8a-736f-4ff8-9a5c-ca3bbc8c459a -var workspace_id=14d01527-62d1-4bad-99ad-37d602c08044: exit status 1 Error: error running command /cnab/app/terraform /usr/bin/terraform apply -auto-approve -input=false -var address_space=10.1.8.0/24 -var arm_environment=public -var is_exposed_externally=false -var tre_id=mrtredemo28 -var tre_resource_id=1f350b8a-736f-4ff8-9a5c-ca3bbc8c459a -var workspace_id=14d01527-62d1-4bad-99ad-37d602c08044: exit status 1 1 error occurred: * mixin execution failed: package command failed

Issue seems to be related to multiple workspace services being deployed/updated in parallel and/or multiple private endpoints/network operations happening in parallel in a single bundle.

marrobi commented 8 months ago

Another

Error: waiting for creation of Private Endpoint "pe-mlflow-mrtredemo28-ws-8044-svc-89f1" (Resource Group "rg-mrtredemo28-ws-8044"): Code="RetryableError" Message="A retryable error occurred." Details=[{"code":"ReferencedResourceNotProvisioned","message":"Cannot proceed with operation because resource /subscriptions/7f1036b4-4d01-43a0-9f4d-602f5151dc0f/resourceGroups/rg-mrtredemo28-ws-8044/providers/Microsoft.Network/virtualNetworks/vnet-mrtredemo28-ws-8044/subnets/ServicesSubnet used by resource /subscriptions/7f1036b4-4d01-43a0-9f4d-602f5151dc0f/resourceGroups/rg-mrtredemo28-ws-8044/providers/Microsoft.Network/networkInterfaces/pe-mlflow-mrtredemo28-ws-8044-svc-89f1.nic.b228d946-de36-46c2-81ee-1e6b06155123 is not in Succeeded state. Resource is in Updating state and the last operation that updated/is updating the resource is PutSubnetOperation."}]
marrobi commented 7 months ago

Ok, this is down to having two operations in progress on the virtual network. On the virtual network. This can happen if one is adding an address space to a workspace in one operation, when another is adding a subnet to the virtual network at the same time.

We need to limit workspace and workspace service operations to one at a time for each workspace.

As user resources to not typically modify the network, do not believe they are an issue.

Or should the TF provider wait if an operation is in progress?