Open pauldotyu opened 2 years ago
Thanks for raising this issue. After tested with latest azurerm provider and below tf config that is similar with yours, seems I cannot repro this issue. Could you try below tf config to see if the issue still exists?
tf config:
provider "azurerm" {
features {}
}
resource "azurerm_resource_group" "example" {
name = "acctest-postgreslqfs-test01"
location = "West Europe"
}
resource "azurerm_postgresql_flexible_server" "example" {
name = "acctest-psqlflexibleserver-test01"
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
version = "12"
administrator_login = "psqladmin"
administrator_password = "B@Dh1CgR3!"
storage_mb = 32768
sku_name = "GP_Standard_D4s_v3"
}
I can replicate this issue using the latest (3.5.0) provider, but only when private networks links are enabled. Creating a server via the portal with equivalent settings is successful.
Example config:
provider "azurerm" {
features {}
}
resource "azurerm_resource_group" "example" {
name = "acctest-postgreslqfs-test01"
location = "West Europe"
}
resource "azurerm_virtual_network" "example" {
location = azurerm_resource_group.example.location
resource_group_name = azurerm_resource_group.example.name
address_space = ["10.100.1.0/24"]
name = "core"
}
resource "azurerm_subnet" "database" {
name = "database"
resource_group_name = azurerm_resource_group.example.name
virtual_network_name = azurerm_virtual_network.example.name
address_prefixes = ["10.100.1.0/26"]
delegation {
name = "postgres-flexible"
service_delegation {
name = "Microsoft.DBforPostgreSQL/flexibleServers"
actions = [
"Microsoft.Network/virtualNetworks/subnets/join/action",
]
}
}
}
resource "azurerm_private_dns_zone" "postgres" {
name = "core.postgres.database.azure.com"
resource_group_name = azurerm_resource_group.example.name
}
resource "azurerm_private_dns_zone_virtual_network_link" "postgres" {
name = "core.postgres.database.azure.com"
private_dns_zone_name = azurerm_private_dns_zone.postgres.name
virtual_network_id = azurerm_virtual_network.example.id
resource_group_name = azurerm_resource_group.example.name
}
resource "azurerm_postgresql_flexible_server" "example" {
name = "example-psqlflexibleserver"
resource_group_name = azurerm_resource_group.example.name
location = azurerm_resource_group.example.location
version = "12"
delegated_subnet_id = azurerm_subnet.database.id
private_dns_zone_id = azurerm_private_dns_zone.postgres.id
administrator_login = "psqladmin"
administrator_password = "H@Sh1CoR3!"
storage_mb = 32768
sku_name = "GP_Standard_D4s_v3"
depends_on = [azurerm_private_dns_zone_virtual_network_link.postgres]
}
Result:
Error: waiting for creation of the Postgresql Flexible Server "example-psqlflexibleserver" (Resource Group "acctest-postgreslqfs-test01"): Code="ServerGroupDropping" Message="Operations on a server group in dropping state are not allowed."
Looks like this issue has been fixed in the with the latest version (3.5) of azurerm provider. Even with private endpoint enabled.
I can still re-create this with my test script above. Does the intermittence of this issue suggest it's internal to Azure rather than an issue with this provider?
Aha scrap that - this is to do with the name of the server being in use or not:
resource "azurerm_postgresql_flexible_server" "example" {
name = "example-psqlflexibleserver"
..
}
> Operations on a server group in dropping state are not allowed.
resource "azurerm_postgresql_flexible_server" "example" {
name = "example-fgdkutitczvgvxr"
..
}
> azurerm_postgresql_flexible_server.example: Creation complete after 4m3s
As @doug-fitzmaurice-rowden pointed it out rightly, it's due to name
not being available. Error message could have informed us the same. Throwing generic error message for such validation errors is so misleading and time wasting.
Do note that a lingering storage container you created attached to the terminal in Azure through the web interface to access the database can also be the culprit for this error message.
This is happening even the resource is deleted. If I re-create from the portal it works but fails from terraform script.
This is still an existing issue. Will it ever be addressed?
The issue still exists on Azurerm 3.65.0; I have struggled for the past week with this issue. I would delete the flex server with Terraform by removing it from our script + Azure CLI after to make sure it was deleted, but I would still hit the above error when I tried to recreate the server again .
Ditto @green-munkey I am experiencing the same issue.
Same issue for me @green-munkey. Very frustrating. We spin up and tear down envs quite regularly, so this is a problem when the dB requires the same name every time.
Experiencing with 3.63
EDIT: Scrap that. Its so inconsistent in its application. Sometimes it allows the same name, other times it does not ...
Having played around with versioning on my test repo, I have found that using azurerm 3.70 / PostgreSQL 15 seems to resolve the issue somewhat. After ~10mins of destroying the resource group my first Terraform apply run failed due to this error, however, my second run worked and has allowed me to use the same dB name. Not sure if this will help anyone, but these are my findings.
AzureRM 3.64 here : I still randomly get the "Operations on a server group in dropping state are not allowed" error message even after I deleted the resource group containing the Postgres flexible server.
I just experienced this issue with AzureRM provider version 3.74.
Using Azurerm 3.72 and region: west US 2,Still randomly getting "message":"Operations on a server group in dropping state are not allowed." I resolved the issue by changing the location from west us2 to east us https://azure.microsoft.com/en-us/explore/global-infrastructure/geographies/#geographies
Hitting this with 3.73 in westeurope.
3.75 Swedencentral error still sometimes occurs - this is database occasionally removed then new one with same name created.
same problem, az 3.75 centralus
Similar problems appeared in West US2, but I was able to fix them by recreating the server deployment a few times.
As I wrote before this problem occurs when I try to recreate server (same name) Finally I checked if I can deploy it manually in portal - same problem - Deploy Failed - "message": "Operations on a server group in dropping state are not allowed." Its Azure problem not Azurerm - probably Azure delays removing process. For end user server it looks like removed, but inside azure-cloud-machine still exist.
Same issue, changed resource group and flexible server fails to recreate with the mentioned error:
"error":{"code":"ServerGroupDropping","message":"Operations on a server group in dropping state are not allowed."}}
edit: retrying in 10 minutes or so, seems to fix the issue.
Are TF use same way to check server name which use inside az cli? For example cmd az postgres flexible-server create --name ${server_name} --resource-group ${resource_group} --subscription ${subscription} --yes --debug show that is calling next: cli.azure.cli.core.sdk.policies: Request body: cli.azure.cli.core.sdk.policies: {"name": "test-uniq-name-31415926", "type": "Microsoft.DBforPostgreSQL/flexibleServers"} urllib3.connectionpool: Starting new HTTPS connection (1): management.azure.com:443 urllib3.connectionpool: https://management.azure.com:443 "POST /subscriptions/XXXXXXX-XXX-XXX/providers/Microsoft.DBforPostgreSQL/locations/westeurope/checkNameAvailability?api-version=2022-12-01 HTTP/1.1" 200 None cli.azure.cli.core.sdk.policies: Response content: cli.azure.cli.core.sdk.policies: {"name":"test-uniq-name-31415926","type":"Microsoft.DBforPostgreSQL/flexibleServers","nameAvailable":false,"reason":"AlreadyExists","message":"Specified server name is already used."}
But command az postgres flexible-server show --name test-uniq-name-31415926 --resource-group ${rg} --subscription ${sname} --debug report 404: DEBUG: cli.azure.cli.core.sdk.policies: This request has no body DEBUG: urllib3.connectionpool: Starting new HTTPS connection (1): management.azure.com:443 DEBUG: urllib3.connectionpool: https://management.azure.com:443 "GET /subscriptions/XXXXX-XXXX-XXXX/resourceGroups/connect-dev/providers/Microsoft.DBforPostgreSQL/flexibleServers/test-uniq-name-31415926?api-version=2022-12-01 HTTP/1.1" 404 256
I suppose that servers are deletinig not directly, but first are placing in "queue for deletion". But in this moment name is removing from some internal "list of used servers". Ok, then direct resource link returns report 404 because use "list if used servers". But create procedure know about "queue for deleting" and uses other API method for check avalability: checkNameAvailability. Are TF inside use checkNameAvailability before attempts create server?
No, azurerm at least version 3.78 use direct resource link for check server: 2023-11-10T00:14:54.534+0100 [DEBUG] provider.terraform-provider-azurerm_v3.78.0_x5: AzureRM Request: GET /subscriptions/XXXXXXXXXXXX/resourceGroups/XXXXXXXXX/providers/Microsoft.DBforPostgreSQL/flexibleServers/flex-test-3141592-v2?api-version=2023-03-01-preview HTTP/1.1 Host: management.azure.com ... HTTP/2.0 404 Not Found
Therefore only this way: while true; do terraform apply -auto-approve if [ $? -eq 0 ]; then break fi sleep 10 done
:(
Same error on 3.81.0
Plan: 1 to add, 0 to change, 0 to destroy.
...
azurerm_postgresql_flexible_server.db: Still creating... [4m0s elapsed]
╷
│ Error: creating Flexible Server (Subscription: "redacted"
│ Resource Group Name: "pro-apps"
│ Flexible Server Name: "pro-db"): polling after Create: polling failed: the Azure API returned the following error:
│
│ Status: "ServerGroupDropping"
│ Code: ""
│ Message: "Operations on a server group in dropping state are not allowed."
│ Activity Id: ""
│
│ ---
│
│ API Response:
│
│ ----[start]----
│ {"name":"redacted","status":"Failed","startTime":"2023-11-17T12:16:34.697Z","error":{"code":"ServerGroupDropping","message":"Operations on a server group in dropping state are not allowed."}}
│ -----[end]-----
│
After 1 hour:
Plan: 1 to add, 0 to change, 0 to destroy.
...
azurerm_postgresql_flexible_server.db: Still creating... [3m0s elapsed]
╷
│ Error: creating Flexible Server (Subscription: "redacted"
│ Resource Group Name: "pro-apps"
│ Flexible Server Name: "pro-db"): polling after Create: polling failed: the Azure API returned the following error:
│
│ Status: "ServerGroupDropping"
│ Code: ""
│ Message: "Operations on a server group in dropping state are not allowed."
│ Activity Id: ""
│
│ ---
│
│ API Response:
│
│ ----[start]----
│ {"name":"redacted","status":"Failed","startTime":"2023-11-17T13:17:24.953Z","error":{"code":"ServerGroupDropping","message":"Operations on a server group in dropping state are not allowed."}}
│ -----[end]-----
│
Changing the name works!!
Plan: 3 to add, 0 to change, 2 to destroy.
Apply complete! Resources: 3 added, 0 changed, 2 destroyed.
Still problem on 3.82.0
in norwayeast
waited over 12 hrs
I'm encountering the same issue on 3.84.0, waited over 4 days. Could it be related to this? Is Azure keeping hold of the name for this feature until it expires?
It seems to be an error in the azure backend services - and it happens if there is a short time-frame between the deletion and the create. I've opened a support-request - but as usual these will just go silent after a while.
Had this issue for the first time yesterday while automatically trying to re-deploy a Postgres flexible server with a different network configuration (I had to switch to private endpoint from the old configuration that was using VNET injection). The existing instance had to be replaced, so this error occurred. I am curious about the fact that for other people seems to be a transient issue while for others changing the name is the only option (other comments have already suggested the expiry time for the deleted resource). In any case, this is super worrying in the context of automatically provisioning infrastructure via a pipeline.
I'm encountering the same issue on 3.84.0, waited over 4 days. Could it be related to this? Is Azure keeping hold of the name for this feature until it expires?
interesting. I have same feeling
Long read: A year ago, we encountered an Azure error: Azure could not restore a certain postgres SINGLE server from a backup. some could, but small servers in which there were no changes - no. The answer from technical support at first was “why do you need to restore? you don’t have any changes there. Just upload them again from your copy.” But after discussions they recognized problem. It turned out that the backup was done through the WAL mechanism. And the backup does not start if you have no changes. Azure corrected something in the backup system and everything worked. For test this we can 1) just after creating - upload test data to the server that is large enough to start the internal process of creating a backup. or 2) just after creating - start restore procedure, which will initiate backup creating.
And then delete the server. If the theory is correct, then at the moment of deletion we will have a backup and Azur will be able to delete the server and not wait until the backup is made!
p.s. There is only one thing that worries me about this - it seems like this is not an internal forum of Azure developers, but a discussion of TF bugs. Why should we, instead of Azure, guess about the reasons for the problem - I don’t know :(
same problem, az 3.85 northeurope, eastus2
Same problem on 3.88.0, eastus2
I ran into same problem where I tried to recreate the resource.
issue was resolved when I set lifecycle policy to false.
@sharapy Where is that setting, I don't see it on the azurerm_postgresql_flexible_server Terraform module?
This is an Azure problem as mentioned by @davidkarlsen. In my case, I hade a deployment that failed due to some networking issues. In any case, during the failed deployment. I had two PostgreSQL flexible servers being deployed, after the first failure, I tried to redeploy with the networking issue fixed. During re-deployment, one server was successfully deployed but the other failed with the error:
Status: "ServerGroupDropping" │ Code: "" │ Message: "Operations on a server group in dropping state are not allowed."
I created a support ticket and got the following response: “Azure PostgreSQL Flexile Server does not guarantee that the customer would be able to re-use the same name once a resource has been dropped.
When an Azure PostgreSQL Flexible Server resource is dropped it marks the instance as no longer in use (after which the customer is no longer billed for the resource) following which it proceeds to clean-up the internal resources associated with the instance. There are a number of internal systems that maintain a DSN cache, which Azure PostgreSQL does not control. Due to which any request made to create a server using a name that is already pre-existing in the cache would fail.”
I am not sure if they were able to manually remove the DNS cache from the internal systems or if waiting a week before re-deploying did the trick. Anyhow, I was able to deploy the the server with the same name after about a week and also asking them to manually remove the DNS cache.
I created a support ticket and got the following response:
“Azure PostgreSQL Flexile Server does not guarantee that the customer would be able to re-use the same name once a resource has been dropped.
When an Azure PostgreSQL Flexible Server resource is dropped it marks the instance as no longer in use (after which the customer is no longer billed for the resource) following which it proceeds to clean-up the internal resources associated with the instance. There are a number of internal systems that maintain a DSN cache, which Azure PostgreSQL does not control. Due to which any request made to create a server using a name that is already pre-existing in the cache would fail.”
I am not sure if they were able to manually remove the DNS cache from the internal systems or if waiting a week before re-deploying did the trick. Anyhow, I was able to deploy the the server with the same name after about a week and also asking them to manually remove the DNS cache.
This is good to know, thanks @poxy91.
A better error message or more clear indicator in the docs would have helped this.
Aha scrap that - this is to do with the name of the server being in use or not:
resource "azurerm_postgresql_flexible_server" "example" { name = "example-psqlflexibleserver" .. } > Operations on a server group in dropping state are not allowed.
resource "azurerm_postgresql_flexible_server" "example" { name = "example-fgdkutitczvgvxr" .. } > azurerm_postgresql_flexible_server.example: Creation complete after 4m3s
So, basically the DB name has to be unique globally? Like not even within the same organisation? :eyes: If that's the case the error message could definitely be improved from Azure's end.
I got this issue on azurerm provider 3.108.0, I deleted the env 3 days prior to recreation still got this issue.
Same issue for me here. Deleted my PostgreSQL Flexible Server and 4 days later the name is still not available. @poxy91 Do you know whether waiting longer of the DNS cache issue was the main reason that it was fixed for you?
So one thing that worked for me, try using the azure cli command to delete the server again. This worked for me and i was able to create the server again with the same name again. @rensoostenbachBL
@AayushbajajCAW What call did you make? Because az postgres flexible-server list
returns an empty list for me, and az postgres flexible-server delete --name <name> --resource-group <rg-name> --yes
tells me that it can not find the resource group.(rightfully so, because I deleted it via the portal myself)
So I'm now in a state where I can not find anything regarding my old deployment in the Portal, but I can not deploy a server with the same name.
Is there an existing issue for this?
Community Note
Terraform Version
1.1.9
AzureRM Provider Version
3.4.0
Affected Resource(s)/Data Source(s)
azurerm_postgresql_flexible_server
Terraform Configuration Files
Debug Output/Panic Output
Expected Behaviour
No response
Actual Behaviour
No response
Steps to Reproduce
No response
Important Factoids
No response
References
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/postgresql_flexible_server