Open joekiller opened 3 months ago
@joekiller I can't find the original issue with details. But we decided to go with abandon state because removePeering has its own set of problems. When a peering is removed it is not deleted. It then sits in this in-between state. There is a limit on the number of stale peerings that can exist. Customers were getting API errors because too many peerings existed (everytime they terraform applied and terraform destroyed a new one was created but not deleted).
While abandon doesn't "fix" the problem, it gives the terraform pipeline an opportunity to succeed without leaving any resources in an in-between state. A true fix would probably have to come from the product API side in CloudSQL that prevents immediate deletion (which would create another set of problems lol)
100% reproducible using hashicorp/google
v5.42.0
and terraform
v1.9.2
on darwin_arm64
and linux_amd64
.
To reproduce:
terraform apply -var project=$GOOGLE_PROJECT_ID
terraform destroy
Error message:
╷
│ Error: Unable to remove Service Networking Connection, err: Error waiting for Delete Service Networking Connection: Error code 9, message: Failed to delete connection; Producer services (e.g. CloudSQL, Cloud Memstore, etc.) are still using this connection.
│ Help Token: AQAb6By_iOkEaXMwmQNzXPulToF-cG4Rtk9gdS7ayT_D8ziGThjT9lT_ZCwcIVUH7Usrmj_Fb400_krKcE2eV4ROLJftZKd8RtniyGZm1yy745HD
│
│
╵
Terraform config:
provider "google" {
project = var.project
region = "europe-north1"
zone = "europe-north1-a"
}
variable "project" {
default = "veritru-dev-332314"
}
resource "google_compute_network" "vpc" {
name = "vpc"
auto_create_subnetworks = false
}
resource "google_compute_global_address" "cloud_sql_ip_range" {
project = var.project
provider = google
name = "cloud-sql-ip-range"
purpose = "VPC_PEERING"
address_type = "INTERNAL"
address = "172.20.0.0"
prefix_length = 16
network = google_compute_network.vpc.id
}
resource "google_service_networking_connection" "cloud_sql_vpc_connection" {
network = google_compute_network.vpc.id
service = "servicenetworking.googleapis.com"
reserved_peering_ranges = [google_compute_global_address.cloud_sql_ip_range.name]
}
resource "random_integer" "db_suffix" {
min = 1000
max = 1999
}
resource "google_sql_database_instance" "postgresql-db01" {
project = var.project
name = "postgresql-db-${random_integer.db_suffix.result}"
region = "europe-north1"
database_version = "POSTGRES_13"
deletion_protection = false
settings {
tier = "db-f1-micro"
location_preference {
zone = "europe-north1-a"
}
ip_configuration {
ipv4_enabled = false
private_network = google_compute_network.vpc.self_link
}
}
depends_on = [google_service_networking_connection.cloud_sql_vpc_connection]
}
@mike-callahan, I understand what you are saying and while I agree with the following vis a vie ensuring servicenetworking.delete works correctly I do not agree that choosing abandon and removing the networks.removePeering call as a better resolution.
A true fix would probably have to come from the product API side in CloudSQL that prevents immediate deletion (which would create another set of problems lol)
As I mentioned originally, Google utilizes the networks.removePeering call when you click the delete peering button in the actual console.
I had deployed and fully removed a cloudsql instance from the project and was only able to remove the peering via the console, which utilized the networks.removePeering function. There was no association left. I agree that Google should fix the apparently dangling association. However, I do not think if a choice is to be made regarding relying on servicenetworking.delete instead of networks.removePeering vs abandoning the connection when even Google doesn't use servicenetworking.delete that abandoning is the best point of action.
I advocate that network.removePeering be used. I imagine that the in between situation where the peering connection is in flux due to infrastructure still utilizing it only existed when the dependsOn attribute was not properly applied which is specified in the documentation regarding this relationship needing to be specially applied when utilizing a private IP. I'm happy to change my opinion if you can surface the situations outside of that case. My deployment was properly using the dependsOn and deleted the cloudsql instance prior to attempting to delete the peering. Regardless of if I did the servicenetworking.delete two hours later or manually, it always failed. As soon as I used network.removePeering it cleaned up and didn't exist in a transient state.
hi @joekiller!
I tried to replicate this issue with this example. But everything works fine without errors creating and destroying the resources. The unique difference is that I added this property: auto_create_subnetworks = false
This is the used terraform code, if you have something different share it with us to try to replicate this issue again:
resource "google_compute_network" "cn_18834" {
name = "cn-18834"
auto_create_subnetworks = false
}
resource "google_compute_global_address" "cga_18834" {
name = "cga-18834"
purpose = "VPC_PEERING"
address_type = "INTERNAL"
prefix_length = 16
network = google_compute_network.cn_18834.id
}
resource "google_service_networking_connection" "snc_18834" {
network = google_compute_network.cn_18834.id
service = "servicenetworking.googleapis.com"
reserved_peering_ranges = [google_compute_global_address.cga_18834.name]
}
@ggtisc, I agree that it works in the limited context that you have cited. The the removal fails if you try to reproduce utilizing the example Johnan provided earlier or the config I referenced, #16275 (where the decision to abandon vs utilize removePeering
).
My argument is that since Google console still uses networks.removePeering instead of servicenetworking.delete that it is reasonable for Terraform to also still use the removePeering
method vs delete
+ abandon.
Yes this is very much in confluence with google_sql_database_instance
, there may be other situations. If a user utilizes google_sql_database_instance
with the appropriate depends_on
notation it is reasonable for the user to expect to not to have to abandon the peering interface when it could be deleted.
referencing possible linked issues
https://github.com/hashicorp/terraform-provider-google/issues/16735 https://github.com/hashicorp/terraform-provider-google/issues/10830 https://github.com/hashicorp/terraform-provider-google/issues/8396 https://github.com/hashicorp/terraform-provider-google/issues/3979 https://github.com/hashicorp/terraform-provider-google/issues/16697 https://github.com/crossplane-contrib/provider-upjet-gcp/issues/474 https://github.com/hashicorp/terraform-provider-google/issues/18834 https://github.com/GoogleCloudPlatform/magic-modules/pull/11332 https://github.com/GoogleCloudPlatform/magic-modules/pull/11364 https://github.com/hashicorp/terraform-provider-google/issues/15260 https://github.com/hashicorp/terraform-provider-google/issues/12902 b/305256825 b/146351146
Thanks for your thoughts @joekiller
I think it would be good to solve this "correctly". At least have terraform mirror the functionality of the console. I will investigate further. It would likely be a breaking change.
Thanks for fishing up those edge cases Mike. 😅 Seems like patching was definitely an improvement for those update situations. Hopefully a combination of patch for update, abandon options, and delete otherwise could generate the most satisfying conclusion.
@rileykarson @zli82016 @c2thorn @roaks3
Service networking connection has been a persistent issue for a while now with various changes having unintended side-effects. We have used a lot of work arounds like abandon state and the latest update peering ranges fix. I propose we coordinate a comprehensive solution to this.
Firstly I want to document the behavior of a service networking connection flow with CloudSQL so we are all in agreement.
Private Service Access for CloudSQL and others works as follows (https://cloud.google.com/service-infrastructure/docs/enabling-private-services-access)
The consumer
From the consumer perspective step 3 does more than one thing including:
The producer
Because Google managed services are deployed as service projects to a peered host project, only one service networking connection is required. New CloudSQL instances and other PSA services will deploy as service projects. That causes the following:
Users should be able to:
I propose some changes/additions to address these items:
Example
# Keep the current google_service_networking_connection as the CRUD resource
resource "google_service_networking_connection" "default" {
network = google_compute_network.peering_network.id
service = "servicenetworking.googleapis.com"
reserved_peering_ranges = [google_compute_global_address.private_ip_alloc.name]
}
# Have an accompanying data resource
data "google_service_networking_connection" "default" {
network = google_compute_network.peering_network.id
name = "servicenetworking-googleapis-com"
}
# Offer an additive config option
# Having multiple of these blocks will append IP ranges to service networking connection
resource "google_service_networking_connection_config" "default" {
network = google_compute_network.peering_network.id
name = data.google_service_networking_connection.default
reserved_peering_ranges = [google_compute_global_address.private_ip_alloc.name]
}
As an aside, after digging more into this and looking back at the beginning of this issue I don't believe the console uses removePeering. It looks to use servicenetworking.delete:
(Assuming you delete it from the servicenetworking pane)
If you delete the VPC peering directly (equivalent of removePeering) (by using the "VPC NETWORK PEERING" pane) instead of using the service networking pane it will "delete" the service networking connection. The connection is still likely there though and this is a GUI bug (might be the cause of https://github.com/hashicorp/terraform-provider-google/issues/15260 and b/305256825?).
Directly deleting the VPC peering does seem to be an issue as it left CloudSQL in an error state
Interestingly I can add back the peering manually and it accepts
If I then try to re-create the service networking connection after the peering is added manually it claims "success" but nothing happens. Looks to be in a broken state
Is it good to forward this issue, or just leave it as it is for now?
Sure thanks
Yea, let's forward this to the service team so that they can weigh in on the proposed solution (FWIW the 3 changes do make sense to me, and don't look like breaking changes).
@mike-callahan I'm going to unassign you for now, and if you end up working on an implementation to solve this, we can add you back
Community Note
Terraform Version & Provider Version(s)
Terraform v1.9.0 on darwin_arm64
Affected Resource(s)
google_service_networking_connection
Terraform Configuration
The details in https://github.com/hashicorp/terraform-provider-google/issues/16275#issue-1947177558 are still applicable.
Debug Output
No response
Expected Behavior
No response
Actual Behavior
No response
Steps to reproduce
terraform apply
Important Factoids
No response
References
The issue "google_service_networking_connection destroy calls appear to always fail in 5.x despite guidance" (https://github.com/hashicorp/terraform-provider-google/issues/16275) was closed by https://github.com/GoogleCloudPlatform/magic-modules/pull/9765 despite that being a bandaid and not a real fix.
Researching this issue, it appears that the google console still utilizes the networks.removePeering as the struct indicates it is monitoring
gceRemoveNetworkPeering
@mike-callahan suggested that the situation dictates
and the former is not a good solution. I advocate that the latter is implemented being that even google's console utilizes the latter technique.
There is a certain irony that even Google found this API call to not work and just worked around it instead of getting it fixed: https://github.com/GoogleCloudPlatform/magic-modules/pull/8904
b/362749609