hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.6k stars 4.65k forks source link

Virtual Hub Connection fails when created along with VNET and Subnets #19894

Open aurel333 opened 1 year ago

aurel333 commented 1 year ago

Is there an existing issue for this?

Community Note

Terraform Version

1.1.9

AzureRM Provider Version

3.5.0

Affected Resource(s)/Data Source(s)

azurerm_virtual_hub_connection

Terraform Configuration Files

variable "vnet_name" {
  description = "Azure virtual network name"
  type        = string
}

variable "current_resource_group" {
  description = "Azure resource group name"
  type            = string
}

variable "vnet_cidrs" {
  description = "Azure virtual network cidrs"
  type        = list(string)
}

variable "subnet1_cidr" {
  description = "Control plane subnet cidr"
  type        = string
}

variable "gbl_infra_sub" {
  description = "Global infrastructure subscribtion for VWAN"
  type = object({
    subscribtion_id       = string
  })
}

data "azurerm_resource_group" "current-rg" {
  name = var.current_resource_group
}

data "azurerm_virtual_hub" "virtual-hub-weu" {
  name                = "azrweuvwanhub0001"
  resource_group_name = "vwan"
  provider            = azurerm.vwan_global
}

resource "azurerm_virtual_network" "cluster-vnet" {
  name                = var.vnet_name
  resource_group_name = data.azurerm_resource_group.current-rg.name
  location            = data.azurerm_resource_group.current-rg.location
  address_space       = var.vnet_cidrs
}

resource "azurerm_subnet" "subnet1" {
  name                 = subnet1
  resource_group_name  = data.azurerm_resource_group.current-rg.name
  virtual_network_name = azurerm_virtual_network.cluster-vnet.name
  address_prefixes     = [var.subnet1_cidr]
}

resource "azurerm_virtual_hub_connection" "vhub-network-connection" {
  name                      = "testrg-to-vhub"
  virtual_hub_id            = var.vwan_hub_id
  remote_virtual_network_id = azurerm_virtual_network.cluster-vnet.id
}

Debug Output/Panic Output

https://gist.github.com/aurel333/11078594212827cf9047828eda22ef2d

The gist is very long, you can search for the error with "PutHubVnetConnectionFailedInPutVnetPeering", it has occurred around line 3014.

Expected Behaviour

Create first the VNET then the subnets and the Virtual Hub Connection in whatever order but not at the same time with them being added correctly in the state.

Actual Behaviour

The VNET and the subnets are created correctly but the Virtual Hub Connection is in failed state and NOT added to the state. The error output looks like a C# stacktrace and is not easily understandable.

Steps to Reproduce

->I managed to almost reliably reproduce the issue by deleting the resources and immediately recreate them. So here are the commands to do.

  1. terraform apply [-var-file ]
  2. terraform destroy [-var-file ]
  3. terraform apply [-var-file ]

Please note that sometimes the problem does not appear, so maybe it is linked to the Azure backend speed to do the operations.

Important Factoids

A ticket to the Azure Support has been opened first and adding "depends_on" to make the VirtualHubConnection resource dependent on the subnet and the vnet is a reliable workaround. However this is not a custom dependency so it should not be required.

References

No response

neil-yechenwei commented 1 year ago

Thanks for raising this issue. I assume the error is expected. You only set the dependency between vnet and vhub connection but you didn't explicitly set the dependency between subnet and vhub connection. So it failed to create vhub connection since the subnet is still in creation while creating vhub connection. Hence ,suggest add "depends_on = [azurerm_subnet.subnet1]" on vhub connection.

aurel333 commented 1 year ago

Hello, thank you for your quick answer. I am raising this issue mainly because I think it should not be required to add the depends_on = [<subnets>] as it is not a custom dependency nor does it seem hidden as creating a azurerm_virtual_hub_connection should be done at the same as a subnet. If I am wrong can you please tell me which type of dependency typically can or cannot be handled by Terraform, this will allow me to avoid problems in the future.

aurel333 commented 1 year ago

Hello, I have dug a bit more on how dependencies were handled and you are right it will not be good to make a azurerm_virtual_hub_connection automatically dependent on the azurerm_subnet so a depends_on keyword is necessary to do this.

However there is still a problem as creating to subnets or attaching two vnet peerings to the same vnet works fine. I saw in the code that these situations are handled by locking the vnet for the duration of the operation and I also see that the same locking mechanism has been implemented for the azurerm_virtual_hub_connection resource (here).

There is also the issue #12998 (that I did not see before) about the same issue so the error I am seeing should have been fixed in a previous provider version. Do you have any idea what can make the lock not working as expected?

aurel333 commented 1 year ago

I have done several more tests with a modified version of the provider based on ee4d44ac0133709272b268337c0f673d999c46b5 to have more targeted logs.

It confirmed that the locking process is what is causing the issue as we have two different mutexes locking the same thing:

...
2023-01-23T16:22:44.024+0100 [DEBUG]: Locked "azurerm_virtual_network.vnet" with mutex 0xc00024cfa0: timestamp=2023-01-23T16:22:44.024+0100
...
2023-01-23T16:22:44.031+0100 [DEBUG]: Locking "azurerm_virtual_network.vnet" with mutex 0xc000496000: timestamp=2023-01-23T16:22:44.031+0100
2023-01-23T16:22:44.031+0100 [DEBUG]: Locked "azurerm_virtual_network.vnet" with mutex 0xc000496000: timestamp=2023-01-23T16:22:44.031+0100
...
2023-01-23T16:22:48.148+0100 [DEBUG]: Unlocking "azurerm_virtual_network.vnet" with mutex 0xc00024cfa0: timestamp=2023-01-23T16:22:48.148+0100
...

At first i thought it was because I put the virtual hub connection inside a module but the issue still arise with just the resources directly in the files too.

As I am not used to the inner workings of the provider I think I will not be able to understand how the mutex system is working alone. Can you help me figure out why two mutexes are created?

aurel333 commented 1 year ago

Continuing with the testing, I have done a test build with the option -parallelism=1 and again I got two different mutexes for the subnet creation and the virtual hub connection.

This does not confirm the hypothesis that there is one mutex per provider but goes very strongly in this direction. If this is true then it is a troublesome issue as the provider instantiation seems to be done by the Terraform core once per provider (which makes sense by the way) and using two or more providers is required to work on multiple subscriptions.

So this means that any cross subscription peering or virtual hub connection is at risk of ending in error if another operation that locks the virtual network is done at the same time (typically a subnet creation or a peering to a different subscription).

Such a risk is either a bug or should be documented somewhere. I can write the documentation but I need first a confirmation that this is not considered a bug.