hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.54k stars 4.61k forks source link

PostgreSQL Flexible Server error #16622

Open pauldotyu opened 2 years ago

pauldotyu commented 2 years ago

Is there an existing issue for this?

Community Note

Terraform Version

1.1.9

AzureRM Provider Version

3.4.0

Affected Resource(s)/Data Source(s)

azurerm_postgresql_flexible_server

Terraform Configuration Files

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "example" {
  name     = "example-resources"
  location = "West Europe"
}

resource "azurerm_postgresql_flexible_server" "example" {
  name                   = "example-psqlflexibleserver"
  resource_group_name    = azurerm_resource_group.example.name
  location               = azurerm_resource_group.example.location
  version                = "12"
  administrator_login    = "psqladmin"
  administrator_password = "H@Sh1CoR3!"

  storage_mb = 32768

  sku_name = "GP_Standard_D4s_v3"
}

Debug Output/Panic Output

│ Error: waiting for creation of the Postgresql Flexible Server "example-psqlflexibleserver" (Resource Group "example-resources"): Code="ServerGroupDropping" Message="Operations on a server group in dropping state are not allowed."
│ 
│   with azurerm_postgresql_flexible_server.example,
│   on main.tf line 10, in resource "azurerm_postgresql_flexible_server" "example":
│   10: resource "azurerm_postgresql_flexible_server" "example" {

Expected Behaviour

No response

Actual Behaviour

No response

Steps to Reproduce

No response

Important Factoids

No response

References

https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/postgresql_flexible_server

neil-yechenwei commented 2 years ago

Thanks for raising this issue. After tested with latest azurerm provider and below tf config that is similar with yours, seems I cannot repro this issue. Could you try below tf config to see if the issue still exists?

tf config:

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "example" {
  name     = "acctest-postgreslqfs-test01"
  location = "West Europe"
}

resource "azurerm_postgresql_flexible_server" "example" {
  name                   = "acctest-psqlflexibleserver-test01"
  resource_group_name    = azurerm_resource_group.example.name
  location               = azurerm_resource_group.example.location
  version                = "12"
  administrator_login    = "psqladmin"
  administrator_password = "B@Dh1CgR3!"

  storage_mb = 32768

  sku_name = "GP_Standard_D4s_v3"
}
doug-fitzmaurice-rowden commented 2 years ago

I can replicate this issue using the latest (3.5.0) provider, but only when private networks links are enabled. Creating a server via the portal with equivalent settings is successful.

Example config:

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "example" {
  name     = "acctest-postgreslqfs-test01"
  location = "West Europe"
}

resource "azurerm_virtual_network" "example" {
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name

  address_space = ["10.100.1.0/24"]
  name          = "core"
}

resource "azurerm_subnet" "database" {
  name                 = "database"
  resource_group_name  = azurerm_resource_group.example.name
  virtual_network_name = azurerm_virtual_network.example.name
  address_prefixes     = ["10.100.1.0/26"]

  delegation {
    name = "postgres-flexible"
    service_delegation {
      name    = "Microsoft.DBforPostgreSQL/flexibleServers"
      actions = [
        "Microsoft.Network/virtualNetworks/subnets/join/action",
      ]
    }
  }
}

resource "azurerm_private_dns_zone" "postgres" {
  name                = "core.postgres.database.azure.com"
  resource_group_name = azurerm_resource_group.example.name
}

resource "azurerm_private_dns_zone_virtual_network_link" "postgres" {
  name                  = "core.postgres.database.azure.com"
  private_dns_zone_name = azurerm_private_dns_zone.postgres.name
  virtual_network_id    = azurerm_virtual_network.example.id
  resource_group_name   = azurerm_resource_group.example.name
}

resource "azurerm_postgresql_flexible_server" "example" {
  name                   = "example-psqlflexibleserver"
  resource_group_name    = azurerm_resource_group.example.name
  location               = azurerm_resource_group.example.location
  version                = "12"
  delegated_subnet_id    = azurerm_subnet.database.id
  private_dns_zone_id    = azurerm_private_dns_zone.postgres.id
  administrator_login    = "psqladmin"
  administrator_password = "H@Sh1CoR3!"

  storage_mb = 32768

  sku_name   = "GP_Standard_D4s_v3"
  depends_on = [azurerm_private_dns_zone_virtual_network_link.postgres]

}

Result:

Error: waiting for creation of the Postgresql Flexible Server "example-psqlflexibleserver" (Resource Group "acctest-postgreslqfs-test01"): Code="ServerGroupDropping" Message="Operations on a server group in dropping state are not allowed."
pauldotyu commented 2 years ago

Looks like this issue has been fixed in the with the latest version (3.5) of azurerm provider. Even with private endpoint enabled.

doug-fitzmaurice-rowden commented 2 years ago

I can still re-create this with my test script above. Does the intermittence of this issue suggest it's internal to Azure rather than an issue with this provider?

doug-fitzmaurice-rowden commented 2 years ago

Aha scrap that - this is to do with the name of the server being in use or not:

resource "azurerm_postgresql_flexible_server" "example" {
  name                   = "example-psqlflexibleserver"
  ..
}
> Operations on a server group in dropping state are not allowed.
resource "azurerm_postgresql_flexible_server" "example" {
  name                   = "example-fgdkutitczvgvxr"
  ..
}
> azurerm_postgresql_flexible_server.example: Creation complete after 4m3s
prabhakarreddy1234 commented 2 years ago

As @doug-fitzmaurice-rowden pointed it out rightly, it's due to name not being available. Error message could have informed us the same. Throwing generic error message for such validation errors is so misleading and time wasting.

aendreas commented 1 year ago

Do note that a lingering storage container you created attached to the terminal in Azure through the web interface to access the database can also be the culprit for this error message.

prashantguleria commented 1 year ago

This is happening even the resource is deleted. If I re-create from the portal it works but fails from terraform script.

Kemyke commented 1 year ago

This is still an existing issue. Will it ever be addressed?

green-munkey commented 1 year ago

The issue still exists on Azurerm 3.65.0; I have struggled for the past week with this issue. I would delete the flex server with Terraform by removing it from our script + Azure CLI after to make sure it was deleted, but I would still hit the above error when I tried to recreate the server again .

DanPaseltiner commented 1 year ago

Ditto @green-munkey I am experiencing the same issue.

Griffa87 commented 1 year ago

Same issue for me @green-munkey. Very frustrating. We spin up and tear down envs quite regularly, so this is a problem when the dB requires the same name every time.

edwardsitj-stratascale commented 1 year ago

Experiencing with 3.63

Griffa87 commented 1 year ago

EDIT: Scrap that. Its so inconsistent in its application. Sometimes it allows the same name, other times it does not ...

Having played around with versioning on my test repo, I have found that using azurerm 3.70 / PostgreSQL 15 seems to resolve the issue somewhat. After ~10mins of destroying the resource group my first Terraform apply run failed due to this error, however, my second run worked and has allowed me to use the same dB name. Not sure if this will help anyone, but these are my findings.

ValentinPettmann commented 1 year ago

AzureRM 3.64 here : I still randomly get the "Operations on a server group in dropping state are not allowed" error message even after I deleted the resource group containing the Postgres flexible server.

ajung-on commented 1 year ago

I just experienced this issue with AzureRM provider version 3.74.

madKrypton commented 1 year ago

Using Azurerm 3.72 and region: west US 2,Still randomly getting "message":"Operations on a server group in dropping state are not allowed." I resolved the issue by changing the location from west us2 to east us https://azure.microsoft.com/en-us/explore/global-infrastructure/geographies/#geographies

The-Judge commented 11 months ago

Hitting this with 3.73 in westeurope.

plalnol commented 11 months ago

3.75 Swedencentral error still sometimes occurs - this is database occasionally removed then new one with same name created.

DzianisMatveyeu commented 11 months ago

same problem, az 3.75 centralus

madKrypton commented 11 months ago

Similar problems appeared in West US2, but I was able to fix them by recreating the server deployment a few times.

plalnol commented 11 months ago

As I wrote before this problem occurs when I try to recreate server (same name) Finally I checked if I can deploy it manually in portal - same problem - Deploy Failed - "message": "Operations on a server group in dropping state are not allowed." Its Azure problem not Azurerm - probably Azure delays removing process. For end user server it looks like removed, but inside azure-cloud-machine still exist.

AurimasNav commented 11 months ago

Same issue, changed resource group and flexible server fails to recreate with the mentioned error:

"error":{"code":"ServerGroupDropping","message":"Operations on a server group in dropping state are not allowed."}}

edit: retrying in 10 minutes or so, seems to fix the issue.

terezbw commented 11 months ago

Are TF use same way to check server name which use inside az cli? For example cmd az postgres flexible-server create --name ${server_name} --resource-group ${resource_group} --subscription ${subscription} --yes --debug show that is calling next: cli.azure.cli.core.sdk.policies: Request body: cli.azure.cli.core.sdk.policies: {"name": "test-uniq-name-31415926", "type": "Microsoft.DBforPostgreSQL/flexibleServers"} urllib3.connectionpool: Starting new HTTPS connection (1): management.azure.com:443 urllib3.connectionpool: https://management.azure.com:443 "POST /subscriptions/XXXXXXX-XXX-XXX/providers/Microsoft.DBforPostgreSQL/locations/westeurope/checkNameAvailability?api-version=2022-12-01 HTTP/1.1" 200 None cli.azure.cli.core.sdk.policies: Response content: cli.azure.cli.core.sdk.policies: {"name":"test-uniq-name-31415926","type":"Microsoft.DBforPostgreSQL/flexibleServers","nameAvailable":false,"reason":"AlreadyExists","message":"Specified server name is already used."}

But command az postgres flexible-server show --name test-uniq-name-31415926 --resource-group ${rg} --subscription ${sname} --debug report 404: DEBUG: cli.azure.cli.core.sdk.policies: This request has no body DEBUG: urllib3.connectionpool: Starting new HTTPS connection (1): management.azure.com:443 DEBUG: urllib3.connectionpool: https://management.azure.com:443 "GET /subscriptions/XXXXX-XXXX-XXXX/resourceGroups/connect-dev/providers/Microsoft.DBforPostgreSQL/flexibleServers/test-uniq-name-31415926?api-version=2022-12-01 HTTP/1.1" 404 256

I suppose that servers are deletinig not directly, but first are placing in "queue for deletion". But in this moment name is removing from some internal "list of used servers". Ok, then direct resource link returns report 404 because use "list if used servers". But create procedure know about "queue for deleting" and uses other API method for check avalability: checkNameAvailability. Are TF inside use checkNameAvailability before attempts create server?

terezbw commented 11 months ago

No, azurerm at least version 3.78 use direct resource link for check server: 2023-11-10T00:14:54.534+0100 [DEBUG] provider.terraform-provider-azurerm_v3.78.0_x5: AzureRM Request: GET /subscriptions/XXXXXXXXXXXX/resourceGroups/XXXXXXXXX/providers/Microsoft.DBforPostgreSQL/flexibleServers/flex-test-3141592-v2?api-version=2023-03-01-preview HTTP/1.1 Host: management.azure.com ... HTTP/2.0 404 Not Found

Therefore only this way: while true; do terraform apply -auto-approve if [ $? -eq 0 ]; then break fi sleep 10 done

:(

ltmleo commented 10 months ago

Same error on 3.81.0

Plan: 1 to add, 0 to change, 0 to destroy.
...
azurerm_postgresql_flexible_server.db: Still creating... [4m0s elapsed]
╷
│ Error: creating Flexible Server (Subscription: "redacted"
│ Resource Group Name: "pro-apps"
│ Flexible Server Name: "pro-db"): polling after Create: polling failed: the Azure API returned the following error:
│ 
│ Status: "ServerGroupDropping"
│ Code: ""
│ Message: "Operations on a server group in dropping state are not allowed."
│ Activity Id: ""
│ 
│ ---
│ 
│ API Response:
│ 
│ ----[start]----
│ {"name":"redacted","status":"Failed","startTime":"2023-11-17T12:16:34.697Z","error":{"code":"ServerGroupDropping","message":"Operations on a server group in dropping state are not allowed."}}
│ -----[end]-----
│ 

After 1 hour:

Plan: 1 to add, 0 to change, 0 to destroy.
...
azurerm_postgresql_flexible_server.db: Still creating... [3m0s elapsed]
╷
│ Error: creating Flexible Server (Subscription: "redacted"
│ Resource Group Name: "pro-apps"
│ Flexible Server Name: "pro-db"): polling after Create: polling failed: the Azure API returned the following error:
│ 
│ Status: "ServerGroupDropping"
│ Code: ""
│ Message: "Operations on a server group in dropping state are not allowed."
│ Activity Id: ""
│ 
│ ---
│ 
│ API Response:
│ 
│ ----[start]----
│ {"name":"redacted","status":"Failed","startTime":"2023-11-17T13:17:24.953Z","error":{"code":"ServerGroupDropping","message":"Operations on a server group in dropping state are not allowed."}}
│ -----[end]-----
│ 

Changing the name works!!

Plan: 3 to add, 0 to change, 2 to destroy.
Apply complete! Resources: 3 added, 0 changed, 2 destroyed.
davidkarlsen commented 10 months ago

Still problem on 3.82.0 in norwayeast waited over 12 hrs

twerthi commented 9 months ago

I'm encountering the same issue on 3.84.0, waited over 4 days. Could it be related to this? Is Azure keeping hold of the name for this feature until it expires?

davidkarlsen commented 9 months ago

It seems to be an error in the azure backend services - and it happens if there is a short time-frame between the deletion and the create. I've opened a support-request - but as usual these will just go silent after a while.

FrancescoCipolla-TomTom commented 9 months ago

Had this issue for the first time yesterday while automatically trying to re-deploy a Postgres flexible server with a different network configuration (I had to switch to private endpoint from the old configuration that was using VNET injection). The existing instance had to be replaced, so this error occurred. I am curious about the fact that for other people seems to be a transient issue while for others changing the name is the only option (other comments have already suggested the expiry time for the deleted resource). In any case, this is super worrying in the context of automatically provisioning infrastructure via a pipeline.

terezbw commented 9 months ago

I'm encountering the same issue on 3.84.0, waited over 4 days. Could it be related to this? Is Azure keeping hold of the name for this feature until it expires?

interesting. I have same feeling

Long read: A year ago, we encountered an Azure error: Azure could not restore a certain postgres SINGLE server from a backup. some could, but small servers in which there were no changes - no. The answer from technical support at first was “why do you need to restore? you don’t have any changes there. Just upload them again from your copy.” But after discussions they recognized problem. It turned out that the backup was done through the WAL mechanism. And the backup does not start if you have no changes. Azure corrected something in the backup system and everything worked. For test this we can 1) just after creating - upload test data to the server that is large enough to start the internal process of creating a backup. or 2) just after creating - start restore procedure, which will initiate backup creating.

And then delete the server. If the theory is correct, then at the moment of deletion we will have a backup and Azur will be able to delete the server and not wait until the backup is made!

p.s. There is only one thing that worries me about this - it seems like this is not an internal forum of Azure developers, but a discussion of TF bugs. Why should we, instead of Azure, guess about the reasons for the problem - I don’t know :(

Roboman341 commented 9 months ago

same problem, az 3.85 northeurope, eastus2

Jamesits commented 8 months ago

Same problem on 3.88.0, eastus2

sharapy commented 7 months ago

I ran into same problem where I tried to recreate the resource.

issue was resolved when I set lifecycle policy to false.

twerthi commented 7 months ago

@sharapy Where is that setting, I don't see it on the azurerm_postgresql_flexible_server Terraform module?

poxy91 commented 7 months ago

This is an Azure problem as mentioned by @davidkarlsen. In my case, I hade a deployment that failed due to some networking issues. In any case, during the failed deployment. I had two PostgreSQL flexible servers being deployed, after the first failure, I tried to redeploy with the networking issue fixed. During re-deployment, one server was successfully deployed but the other failed with the error:

Status: "ServerGroupDropping" │ Code: "" │ Message: "Operations on a server group in dropping state are not allowed."

I created a support ticket and got the following response: “Azure PostgreSQL Flexile Server does not guarantee that the customer would be able to re-use the same name once a resource has been dropped.

When an Azure PostgreSQL Flexible Server resource is dropped it marks the instance as no longer in use (after which the customer is no longer billed for the resource) following which it proceeds to clean-up the internal resources associated with the instance. There are a number of internal systems that maintain a DSN cache, which Azure PostgreSQL does not control. Due to which any request made to create a server using a name that is already pre-existing in the cache would fail.”

I am not sure if they were able to manually remove the DNS cache from the internal systems or if waiting a week before re-deploying did the trick. Anyhow, I was able to deploy the the server with the same name after about a week and also asking them to manually remove the DNS cache.

dslatkin commented 6 months ago

I created a support ticket and got the following response:

“Azure PostgreSQL Flexile Server does not guarantee that the customer would be able to re-use the same name once a resource has been dropped.

When an Azure PostgreSQL Flexible Server resource is dropped it marks the instance as no longer in use (after which the customer is no longer billed for the resource) following which it proceeds to clean-up the internal resources associated with the instance. There are a number of internal systems that maintain a DSN cache, which Azure PostgreSQL does not control. Due to which any request made to create a server using a name that is already pre-existing in the cache would fail.”

I am not sure if they were able to manually remove the DNS cache from the internal systems or if waiting a week before re-deploying did the trick. Anyhow, I was able to deploy the the server with the same name after about a week and also asking them to manually remove the DNS cache.

This is good to know, thanks @poxy91.

A better error message or more clear indicator in the docs would have helped this.

Jarmos-san commented 6 months ago

Aha scrap that - this is to do with the name of the server being in use or not:

resource "azurerm_postgresql_flexible_server" "example" {
  name                   = "example-psqlflexibleserver"
  ..
}
> Operations on a server group in dropping state are not allowed.
resource "azurerm_postgresql_flexible_server" "example" {
  name                   = "example-fgdkutitczvgvxr"
  ..
}
> azurerm_postgresql_flexible_server.example: Creation complete after 4m3s

So, basically the DB name has to be unique globally? Like not even within the same organisation? :eyes: If that's the case the error message could definitely be improved from Azure's end.

AayushbajajCAW commented 3 months ago

I got this issue on azurerm provider 3.108.0, I deleted the env 3 days prior to recreation still got this issue.