hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.6k stars 4.65k forks source link

[BUG] AzureRM - Azure Postgres Flexible Server - Virtual Endpoint Attempts to re-create after Failover #27796

Open leonrob opened 2 weeks ago

leonrob commented 2 weeks ago

Is there an existing issue for this?

Community Note

Terraform Version

0.13

AzureRM Provider Version

4.7.0

Affected Resource(s)/Data Source(s)

azurerm_postgresql_flexible_server_virtual_endpoint

Terraform Configuration Files

resource "azurerm_postgresql_flexible_server_virtual_endpoint" "testendpoint" {
  name              = "testendpoint1"
  source_server_id  = data.azurerm_postgresql_flexible_server.centralpg.id
  replica_server_id = data.azurerm_postgresql_flexible_server.eastpgreplica.id
  type              = "ReadWrite"

  depends_on = [
    data.azurerm_postgresql_flexible_server.centralpg,
    data.azurerm_postgresql_flexible_server.eastpgreplica
  ]

}

Debug Output/Panic Output

Whitespace change before failover shows 0 changes.

After manually promoting the replica server to the primary server in the azure UI, this happens on whitespace change:

Terraform will perform the following actions:

  # azurerm_postgresql_flexible_server_virtual_endpoint.testendpoint will be created
  + resource "azurerm_postgresql_flexible_server_virtual_endpoint" "testendpoint" {
      + id                = (known after apply)
      + name              = "testendpoint1"
      + replica_server_id = "/subscriptions/XX/resourceGroups/eastus2-cloudpipelines-dev-rg/providers/Microsoft.DBforPostgreSQL/flexibleServers/eastus2-replica-test-demo-dev-fpg"
      + source_server_id  = "/subscriptions/XX/resourceGroups/centralus-development-dev-rg/providers/Microsoft.DBforPostgreSQL/flexibleServers/centralus-test-demo-dev-fpg"
      + type              = "ReadWrite"
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Expected Behaviour

It should see there is already an endpoint assigned to both resources that is functional.

Actual Behaviour

No response

Steps to Reproduce

Whitespace change

Important Factoids

No response

References

No response

leonrob commented 2 weeks ago

Also - for what it's worth... I've attempted to add lifecycle prevent destroy and it does not work. The only workaround I found is:

Create a var:

variable "create_virtual_endpoint" { type = bool default = false # Change this based on your workspace context }

Use var as a bool to create or not. But on initial creation it would need set to true. It would need changed to "False" in a separate PR after. I'm trying to reduce the amount of steps required.

resource "azurerm_postgresql_flexible_server_virtual_endpoint" "testendpoint" { count = var.create_virtual_endpoint ? 1 : 0 name = "testendpoint1" source_server_id = data.azurerm_postgresql_flexible_server.centralpg.id replica_server_id = data.azurerm_postgresql_flexible_server.eastpgreplica.id type = "ReadWrite"

depends_on = [ data.azurerm_postgresql_flexible_server.centralpg, data.azurerm_postgresql_flexible_server.eastpgreplica ]

lifecycle { ignore_changes = ["*"] } }

neil-yechenwei commented 2 weeks ago

Thanks for raising this issue. The prevent destroy needs to be added from the beginning. Seems I can't reproduce this issue. Could you double check if below reproduce steps are expected?

Reproduce steps:

  1. tf apply with below tf config
  2. Exchange the values for zone and standby_availability_zone
  3. tf apply again

tf config:

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "test" {
  name     = "acctestRG-postgresql-test01"
  location = "eastus"
}

resource "azurerm_postgresql_flexible_server" "test" {
  name                          = "acctest-fs-test01"
  resource_group_name           = azurerm_resource_group.test.name
  location                      = azurerm_resource_group.test.location
  version                       = "16"
  public_network_access_enabled = false
  administrator_login           = "adminTerraform"
  administrator_password        = "QAZwsx123"
  zone                          = "1"
  storage_mb                    = 32768
  storage_tier                  = "P30"
  sku_name                      = "GP_Standard_D2ads_v5"

  high_availability {
    mode                      = "ZoneRedundant"
    standby_availability_zone = "2"
  }
}

resource "azurerm_postgresql_flexible_server" "test_replica" {
  name                          = "acctest-ve-replica-test01"
  resource_group_name           = azurerm_postgresql_flexible_server.test.resource_group_name
  location                      = azurerm_postgresql_flexible_server.test.location
  create_mode                   = "Replica"
  source_server_id              = azurerm_postgresql_flexible_server.test.id
  version                       = "16"
  public_network_access_enabled = false
  zone                          = "1"
  storage_mb                    = 32768
  storage_tier                  = "P30"
}

resource "azurerm_postgresql_flexible_server_virtual_endpoint" "test" {
  name              = "acctest-ve-test01"
  source_server_id  = azurerm_postgresql_flexible_server.test.id
  replica_server_id = azurerm_postgresql_flexible_server.test_replica.id
  type              = "ReadWrite"
}
leonrob commented 2 weeks ago

Apologies you're using "ZoneRedundant" for the HA mode

It's actually Replica

CorrenSoft commented 1 week ago

According to the plan that you shared, is just creating a virtual endpoint (does not include the destruction part), which may suggest that the Virtuaal Endpoint was already destroyed on the Failover. Could be that the case?

Is not uncommon that in failover scenarios, due to the changes made in the process, the Terraform code gets outdated. In those situations, you will need to decide between to restore the original configuration once the situation that triggered the failover is no longer valid, or update the code to properly describe the new status.

leonrob commented 1 week ago

According to the plan that you shared, is just creating a virtual endpoint (does not include the destruction part), which may suggest that the Virtuaal Endpoint was already destroyed on the Failover. Could be that the case?

Is not uncommon that in failover scenarios, due to the changes made in the process, the Terraform code gets outdated. In those situations, you will need to decide between to restore the original configuration once the situation that triggered the failover is no longer valid, or update the code to properly describe the new status.

Hey CorrenSoft thanks for the reply. It actually does NOT destroy the endpoint. I have ton some extremely extensive tests on this to replicate it. I can replicate this super easily.

If possible, would you be willing to hop on a call with me? No pressure or anything. That way I can show you this. My company is a fortune 500 but we aren't a Terraform enterprise customer. (Although we spend a large amount with Hashi :-D )

Thanks in advance

CorrenSoft commented 1 week ago

Not sure if it would be appropriate since I don't work for Hashicorp :p Besides, I am not familiar enough (yet) with this resource, just provided my input based on my experience with failover with other resources.

Just to increase the context information, Did you say that the failover did not destroy the endpoint? If so, does the apply step actually create a new one?

leonrob commented 1 week ago

Not sure if it would be appropriate since I don't work for Hashicorp :p Besides, I am not familiar enough (yet) with this resource, just provided my input based on my experience with failover with other resources.

Just to increase the context information, Did you say that the failover did not destroy the endpoint? If so, does the apply step actually create a new one?

Oh I apologize, I thought you did! lol.

Yes. The failover did NOT destroy the endpoint. Which is expected.

The database servers should be able to fail between each other without destruction

My concern is that the Terraform doesn't see the virtual endpoint when it goes to check state. Even though it already exists.

It's 100% a bug on Hashi's end. There was another bug related to this that I was able to get someone to fix, but that person no longer works at hashi

leonrob commented 2 days ago

Anyone from hashi take a peek at this yet?