hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.29k stars 1.72k forks source link

google_service_networking_connection uses servicenetworking.delete when google console utilizes networks.removePeering #18834

Open joekiller opened 1 month ago

joekiller commented 1 month ago

Community Note

Terraform Version & Provider Version(s)

Terraform v1.9.0 on darwin_arm64

Affected Resource(s)

google_service_networking_connection

Terraform Configuration

The details in https://github.com/hashicorp/terraform-provider-google/issues/16275#issue-1947177558 are still applicable.

Debug Output

No response

Expected Behavior

No response

Actual Behavior

No response

Steps to reproduce

  1. terraform apply

Important Factoids

No response

References

The issue "google_service_networking_connection destroy calls appear to always fail in 5.x despite guidance" (https://github.com/hashicorp/terraform-provider-google/issues/16275) was closed by https://github.com/GoogleCloudPlatform/magic-modules/pull/9765 despite that being a bandaid and not a real fix.

Researching this issue, it appears that the google console still utilizes the networks.removePeering as the struct indicates it is monitoring gceRemoveNetworkPeering

[
  {
    "results": [
      {
        "data": {
          "response": {
            "name": "operations/flow/abc123",
            "metadata": {
              "@type": "type.googleapis.com/com.google.cloud.clientapi.flows.api.FlowMetadataProto",
              "code": "PENDING",
              "phantomData": {
                "@type": "type.googleapis.com/google.protobuf.Struct",
                "value": {
                  "phantomRows": [
                    {
                      "networkUrl": "projects/redacted/global/networks/network-vpc",
                      "peeringName": "servicenetworking-googleapis-com",
                      "tooltipMessageType": "RESOURCE_BEING_DELETED"
                    }
                  ]
                }
              },
              "createTime": "2024-07-24T15:03:01.471Z",
              "description": {
                "descriptionKey": "gceRemoveNetworkPeering",
                "descriptionArgs": {
                  "resourceName": "servicenetworking-googleapis-com"
                }
              },
              "project": {
                "projectNumber": "redacted11111",
                "projectId": "redacted",
                "projectName": "redacted"
              },
              "flowType": "cloud-console.coliseum.poller"
            },
            "done": false,
            "error": null,
            "response": null,
            "result": "resultNotSet"
          }
        },
        "path": []
      }
    ],
    "responseContext": {
      "eti": ""
    }
  }
]'

@mike-callahan suggested that the situation dictates

either state is abandoned or add back removePeering functionality that uses the compute api

and the former is not a good solution. I advocate that the latter is implemented being that even google's console utilizes the latter technique.

There is a certain irony that even Google found this API call to not work and just worked around it instead of getting it fixed: https://github.com/GoogleCloudPlatform/magic-modules/pull/8904

b/362749609

mike-callahan commented 1 month ago

@joekiller I can't find the original issue with details. But we decided to go with abandon state because removePeering has its own set of problems. When a peering is removed it is not deleted. It then sits in this in-between state. There is a limit on the number of stale peerings that can exist. Customers were getting API errors because too many peerings existed (everytime they terraform applied and terraform destroyed a new one was created but not deleted).

While abandon doesn't "fix" the problem, it gives the terraform pipeline an opportunity to succeed without leaving any resources in an in-between state. A true fix would probably have to come from the product API side in CloudSQL that prevents immediate deletion (which would create another set of problems lol)

johanblumenberg commented 3 weeks ago

100% reproducible using hashicorp/google v5.42.0 and terraform v1.9.2 on darwin_arm64 and linux_amd64.

To reproduce:

terraform apply -var project=$GOOGLE_PROJECT_ID
terraform destroy

Error message:

╷
│ Error: Unable to remove Service Networking Connection, err: Error waiting for Delete Service Networking Connection: Error code 9, message: Failed to delete connection; Producer services (e.g. CloudSQL, Cloud Memstore, etc.) are still using this connection.
│ Help Token: AQAb6By_iOkEaXMwmQNzXPulToF-cG4Rtk9gdS7ayT_D8ziGThjT9lT_ZCwcIVUH7Usrmj_Fb400_krKcE2eV4ROLJftZKd8RtniyGZm1yy745HD
│ 
│ 
╵

Terraform config:

provider "google" {
  project = var.project
  region  = "europe-north1"
  zone    = "europe-north1-a"
}

variable "project" {
  default = "veritru-dev-332314"
}

resource "google_compute_network" "vpc" {
  name                    = "vpc"
  auto_create_subnetworks = false
}

resource "google_compute_global_address" "cloud_sql_ip_range" {
  project       = var.project
  provider      = google
  name          = "cloud-sql-ip-range"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  address       = "172.20.0.0"
  prefix_length = 16
  network       = google_compute_network.vpc.id
}

resource "google_service_networking_connection" "cloud_sql_vpc_connection" {
  network                 = google_compute_network.vpc.id
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.cloud_sql_ip_range.name]
}

resource "random_integer" "db_suffix" {
  min = 1000
  max = 1999
}
resource "google_sql_database_instance" "postgresql-db01" {
  project             = var.project
  name                = "postgresql-db-${random_integer.db_suffix.result}"
  region              = "europe-north1"
  database_version    = "POSTGRES_13"
  deletion_protection = false

  settings {
    tier = "db-f1-micro"

    location_preference {
      zone = "europe-north1-a"
    }

    ip_configuration {
      ipv4_enabled    = false
      private_network = google_compute_network.vpc.self_link
    }
  }
  depends_on = [google_service_networking_connection.cloud_sql_vpc_connection]
}
joekiller commented 3 weeks ago

@mike-callahan, I understand what you are saying and while I agree with the following vis a vie ensuring servicenetworking.delete works correctly I do not agree that choosing abandon and removing the networks.removePeering call as a better resolution.

A true fix would probably have to come from the product API side in CloudSQL that prevents immediate deletion (which would create another set of problems lol)

As I mentioned originally, Google utilizes the networks.removePeering call when you click the delete peering button in the actual console.

I had deployed and fully removed a cloudsql instance from the project and was only able to remove the peering via the console, which utilized the networks.removePeering function. There was no association left. I agree that Google should fix the apparently dangling association. However, I do not think if a choice is to be made regarding relying on servicenetworking.delete instead of networks.removePeering vs abandoning the connection when even Google doesn't use servicenetworking.delete that abandoning is the best point of action.

I advocate that network.removePeering be used. I imagine that the in between situation where the peering connection is in flux due to infrastructure still utilizing it only existed when the dependsOn attribute was not properly applied which is specified in the documentation regarding this relationship needing to be specially applied when utilizing a private IP. I'm happy to change my opinion if you can surface the situations outside of that case. My deployment was properly using the dependsOn and deleted the cloudsql instance prior to attempting to delete the peering. Regardless of if I did the servicenetworking.delete two hours later or manually, it always failed. As soon as I used network.removePeering it cleaned up and didn't exist in a transient state.

ggtisc commented 2 weeks ago

hi @joekiller!

I tried to replicate this issue with this example. But everything works fine without errors creating and destroying the resources. The unique difference is that I added this property: auto_create_subnetworks = false

This is the used terraform code, if you have something different share it with us to try to replicate this issue again:

resource "google_compute_network" "cn_18834" {
  name = "cn-18834"
  auto_create_subnetworks = false
}

resource "google_compute_global_address" "cga_18834" {
  name          = "cga-18834"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = google_compute_network.cn_18834.id
}

resource "google_service_networking_connection" "snc_18834" {
  network                 = google_compute_network.cn_18834.id
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.cga_18834.name]
}
joekiller commented 2 weeks ago

@ggtisc, I agree that it works in the limited context that you have cited. The the removal fails if you try to reproduce utilizing the example Johnan provided earlier or the config I referenced, #16275 (where the decision to abandon vs utilize removePeering).

My argument is that since Google console still uses networks.removePeering instead of servicenetworking.delete that it is reasonable for Terraform to also still use the removePeering method vs delete + abandon.

Yes this is very much in confluence with google_sql_database_instance, there may be other situations. If a user utilizes google_sql_database_instance with the appropriate depends_on notation it is reasonable for the user to expect to not to have to abandon the peering interface when it could be deleted.

mike-callahan commented 2 weeks ago

referencing possible linked issues

https://github.com/hashicorp/terraform-provider-google/issues/16735 https://github.com/hashicorp/terraform-provider-google/issues/10830 https://github.com/hashicorp/terraform-provider-google/issues/8396 https://github.com/hashicorp/terraform-provider-google/issues/3979 https://github.com/hashicorp/terraform-provider-google/issues/16697 https://github.com/crossplane-contrib/provider-upjet-gcp/issues/474 https://github.com/hashicorp/terraform-provider-google/issues/18834 https://github.com/GoogleCloudPlatform/magic-modules/pull/11332 https://github.com/GoogleCloudPlatform/magic-modules/pull/11364 https://github.com/hashicorp/terraform-provider-google/issues/15260 https://github.com/hashicorp/terraform-provider-google/issues/12902 b/305256825 b/146351146

mike-callahan commented 2 weeks ago

Thanks for your thoughts @joekiller

I think it would be good to solve this "correctly". At least have terraform mirror the functionality of the console. I will investigate further. It would likely be a breaking change.

joekiller commented 2 weeks ago

Thanks for fishing up those edge cases Mike. 😅 Seems like patching was definitely an improvement for those update situations. Hopefully a combination of patch for update, abandon options, and delete otherwise could generate the most satisfying conclusion.

mike-callahan commented 2 weeks ago

@rileykarson @zli82016 @c2thorn @roaks3

Service networking connection has been a persistent issue for a while now with various changes having unintended side-effects. We have used a lot of work arounds like abandon state and the latest update peering ranges fix. I propose we coordinate a comprehensive solution to this.

Firstly I want to document the behavior of a service networking connection flow with CloudSQL so we are all in agreement.

Private Service Access for CloudSQL and others works as follows (https://cloud.google.com/service-infrastructure/docs/enabling-private-services-access)

The consumer

  1. Enables Service Networking API
  2. Allocates an IP range
  3. Creates a "private connection to services"

From the consumer perspective step 3 does more than one thing including:

  1. Peering the consumer VPC with a Google managed VPC
  2. Importing routes to the subnet containing the managed instance (automatic/non-configurable)

The producer

  1. Creates a project for service networking
  2. Creates a project for CloudSQL
  3. Makes the service networking project a Shared VPC host
  4. Attaches the CloudSQL project to the Shared VPC host project
  5. Deploys a SQL instance onto the Shared VPC subnet
  6. Peers with the consumer project
  7. Exports routes (automatic/non-configurable)

Because Google managed services are deployed as service projects to a peered host project, only one service networking connection is required. New CloudSQL instances and other PSA services will deploy as service projects. That causes the following:

  1. Creating and deleting a service networking connection is actually creating and deleting a Google Shared VPC. Because CloudSQL is sitting on the Shared VPC it must be fully deleted before the Shared VPC can be deleted. This causes the behavior in issue-1947177558
  2. removePeering deletes the peering between the consumer VPC and the Google Shared VPC but does not delete the Google Shared VPC. This causes the behavior in b/305256825
  3. Consumers using a their own Shared VPC service projects unintentionally overwrite each others service networking configs issue 16735
mike-callahan commented 2 weeks ago

Users should be able to:

  1. Create and destroy CloudSQL instances repeatedly
  2. Consume a service networking connection without having to create it or hard-code its self-link
  3. Update ip allocations without breaking other users terraform

I propose some changes/additions to address these items:

  1. Offer a service_networking_connection data block. Network admins can create the service networking connection along with their VPC since its a VPC-wide configuration. Users can reference the datablock in their CloudSQL configuration instead of trying to create a new one.
  2. Allow the optional breakout of the networking connection config to allow additive ranges (basically the ask in issue 16735. issue 10830 being fixed and available in 6.0 set us up well for this by fixing the default CRUD action.
  3. Update the CloudSQL terraform documentation. The documentation inadvertently suggests that a new service_networking_connection should exist for each CloudSQL instance. It should clearly state that service_networking_connection is a VPC wide resource.

Example

# Keep the current google_service_networking_connection as the CRUD resource
resource "google_service_networking_connection" "default" {
  network                = google_compute_network.peering_network.id
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = [google_compute_global_address.private_ip_alloc.name]
}

# Have an accompanying data resource 
data "google_service_networking_connection" "default" {
  network                = google_compute_network.peering_network.id
  name                 = "servicenetworking-googleapis-com"
}

# Offer an additive config option
# Having multiple of these blocks will append IP ranges to service networking connection
resource "google_service_networking_connection_config" "default" {
  network                = google_compute_network.peering_network.id
  name                 =  data.google_service_networking_connection.default
  reserved_peering_ranges = [google_compute_global_address.private_ip_alloc.name]
}
mike-callahan commented 2 weeks ago

As an aside, after digging more into this and looking back at the beginning of this issue I don't believe the console uses removePeering. It looks to use servicenetworking.delete:

Screenshot 2024-08-27 at 9 06 35 PM

(Assuming you delete it from the servicenetworking pane)

Screenshot 2024-08-27 at 9 11 29 PM
mike-callahan commented 2 weeks ago

If you delete the VPC peering directly (equivalent of removePeering) (by using the "VPC NETWORK PEERING" pane) instead of using the service networking pane it will "delete" the service networking connection. The connection is still likely there though and this is a GUI bug (might be the cause of https://github.com/hashicorp/terraform-provider-google/issues/15260 and b/305256825?).

Screenshot 2024-08-27 at 9 13 14 PM Screenshot 2024-08-27 at 9 13 35 PM
mike-callahan commented 2 weeks ago

Directly deleting the VPC peering does seem to be an issue as it left CloudSQL in an error state

Screenshot 2024-08-27 at 9 46 22 PM

Interestingly I can add back the peering manually and it accepts

Screenshot 2024-08-27 at 9 52 27 PM

If I then try to re-create the service networking connection after the peering is added manually it claims "success" but nothing happens. Looks to be in a broken state

Screenshot 2024-08-27 at 9 55 54 PM Screenshot 2024-08-27 at 9 55 58 PM
ggtisc commented 2 weeks ago

Is it good to forward this issue, or just leave it as it is for now?

mike-callahan commented 2 weeks ago

Sure thanks

roaks3 commented 2 weeks ago

Yea, let's forward this to the service team so that they can weigh in on the proposed solution (FWIW the 3 changes do make sense to me, and don't look like breaking changes).

@mike-callahan I'm going to unassign you for now, and if you end up working on an implementation to solve this, we can add you back