hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform
https://registry.terraform.io/providers/hashicorp/google/latest/docs
Mozilla Public License 2.0
2.34k stars 1.74k forks source link

google_sql_database_instance: Error creating resources using Private IPs in parallel. #3069

Open yuvaldrori opened 5 years ago

yuvaldrori commented 5 years ago

Community Note

Terraform Version

Terraform v0.11.11

Affected Resource(s)

Terraform Configuration Files

provider "google" {
  region = "${var.region}"
}

provider "google-beta" {
  region = "${var.region}"
}

variable "region" {
  default = "us-central1"
}

variable "org_id" {
  default = "*****"
}

variable "billing_account" {
  default = "*******"
}

variable "count" {
  default = 2
}

resource "random_id" "project" {
  byte_length = 4
  prefix      = "test-tf-project-"
}

resource "google_project" "project" {
  name                = "Test Terraform Project"
  project_id          = "${random_id.project.hex}"
  org_id              = "${var.org_id}"
  auto_create_network = false
  billing_account     = "${var.billing_account}"
}

resource "google_project_service" "networking" {
  project                    = "${google_project.project.project_id}"
  service                    = "servicenetworking.googleapis.com"
  disable_on_destroy         = false
  disable_dependent_services = true
}

resource "google_compute_network" "network" {
  description             = "Network"
  name                    = "test-network"
  auto_create_subnetworks = "false"
  project                 = "${google_project.project.project_id}"
}

resource "google_compute_global_address" "private_ip_alloc" {
  provider      = "google-beta"
  name          = "private-ip-alloc"
  purpose       = "VPC_PEERING"
  address_type  = "INTERNAL"
  prefix_length = 16
  network       = "${google_compute_network.network.self_link}"
  project       = "${google_project_service.networking.project}"
}

resource "google_service_networking_connection" "connection" {
  provider                = "google-beta"
  network                 = "${google_compute_network.network.self_link}"
  service                 = "servicenetworking.googleapis.com"
  reserved_peering_ranges = ["${google_compute_global_address.private_ip_alloc.name}"]
}

resource "random_id" "master" {
  byte_length = 4
  prefix      = "master-"
}

resource "google_sql_database_instance" "master" {
  count            = "${var.count}"
  name             = "${random_id.master.hex}-${count.index}"
  database_version = "MYSQL_5_7"
  region           = "${var.region}"
  project          = "${google_project.project.project_id}"

  settings {
    tier      = "db-f1-micro"

    ip_configuration {
      private_network = "${google_service_networking_connection.connection.network}"
    }
  }
}

Debug Output

https://gist.github.com/yuvaldrori/034fd15acff47edf83af77dea885fa36

Panic Output

Expected Behavior

All resources should have been created successfully. If you change the variable count = 1 it will succeed.

Actual Behavior

Only one CloudSQL gets created successfully.

Steps to Reproduce

  1. terraform apply

Important Factoids

Tried similar script with one CloudSQL and one GKE cluster and many GKE clusters with private IPs - same results.

References

b/261385017

chrisst commented 5 years ago

I have tried a couple of different times and been unable to reproduce this failure. Also based on the api responses in your debug log I can't see what the exact failure was because the Operation polling just shows "code": "UNKNOWN". Are you still able to consistently reproduce this error and if so can you look in the cloud console UI and see if there is a more detailed reason that the sql instance failed to create?

yuvaldrori commented 5 years ago

@chrisst just run the tf script again: one machine succeeded and the other failed with the same unknown error. The ui says exactly the same: "An unknown error occurred". I did open a ticket with Google support and they asked me to see if I can test it with a gcloud command - I was not able to repro with gcloud. The bash script I used: `

!/bin/bash

for i in 774yhf5 59swec6 do gcloud beta sql instances create gcloud-test-$i --async --database-version MYSQL_5_7 --tier db-f1-micro --region us-central1 --network network --project=some-project-name & done `

chrisst commented 5 years ago

I'm still unable to reproduce your error with terraform, but using the UI I was unable to modify or delete multiple peering routes because of the error: "There is a peering operation in progress on the local or peer network. Try again later." It sounds like this could be what is happening with your config. Can you check http://console.cloud.google.com/networking/peering/list and http://console.cloud.google.com/networking/routes/list to see if there is a similar error on any of those automatically generated resources?

If there is, we should be able to tweak the lock on sql operations to account for the peering operations as well. It won't solve cross resource contention (sql + gke) but it should fix the count based sql clusters.

yuvaldrori commented 5 years ago

Sorry for the late reply, I just set up notification. Anyway, I ran the TF script again and was able to see the same errors and when I check the routes and peering list all looks OK:

gcloud alpha services vpc-peerings list --project test-tf-project-e93f7cc2 --network test-network
---
network: projects/221637507821/global/networks/test-network
peering: servicenetworking-googleapis-com
reservedPeeringRanges:
- private-ip-alloc
service: services/servicenetworking.googleapis.com
---
network: projects/221637507821/global/networks/test-network
peering: cloudsql-mysql-googleapis-com
reservedPeeringRanges:
- private-ip-alloc
service: services/servicenetworking.googleapis.com
gcloud compute routes list --project test-tf-project-e93f7cc2 
NAME                            NETWORK       DEST_RANGE       NEXT_HOP                       PRIORITY
default-route-28a3d65e45473cb3  test-network  172.20.181.0/24  test-network                   1000
default-route-b6db25e614d13793  test-network  0.0.0.0/0        default-internet-gateway       1000
peering-route-339f952832fab934  test-network  192.168.0.0/24   cloudsql-mysql-googleapis-com  1000

I don't get how come I can get this error every time and you cannot - what can be different in our setup?

chrisst commented 5 years ago

Ok I was able to reproduce on a consistent basis by tearing down and spinning up the project and network connections at the same time. I was hoping this was something that could be controlled by locking SQL instance operations based on the project name but I don't think it's the case. At this point since it's only reproducible when other non-sql resources are being created it's not possible to identify this situation from within Terraform and so I'm not sure there's a good way to guard against it. I'm try and get a bug updated on the sql api to see if they can provide retries or a better error for this.

chrisst commented 5 years ago

Update - I've been talking with the private networking team and they are working on a fix for this. They let me know that this is happening because there is an entry that gets set up the first time any private networking feature is turned on for a project/network. Creating the 2 instances at the same time causes a collision setting up this singleton, so if you are able to set up 1 resource that uses private networking before creating others in parallel you should be able to work around this issue.

bantalon commented 5 years ago

Suffering from this issue as well. I posted a question and workaround in Stack Overflow: https://stackoverflow.com/questions/55990713/how-to-fix-an-unknown-error-occurred-when-creating-multiple-google-cloud-sql-i/55991852#55991852.

ctrox commented 5 years ago

Can we get an update on this? We are running into it regularly when setting up databases for multiple environments and we have to do two separate terraform runs to work around this. The delay workaround does not really work in our case as we are using a module for cloudsql and you cannot have one module wait on the other (at least not in a simple non-hacky way).

iniinikoski commented 4 years ago

Hi, any updates on this...?

chrisst commented 4 years ago

Sorry no update at this point. The upstream team is still working on it and I'll update if I see that anything has been resolved.

curtbushko commented 4 years ago

+1 for looking for a fix for this.

gabriel8fm commented 4 years ago

+1 for looking for a fix for this.

mattseymour commented 3 years ago

Is there an update on this. I have been trying to debug an issue which turned out to be this.

At a minimum can a note be placed in the docs alerting devs to this short coming.

djboboch commented 1 year ago

Hi Team, any updates issue has been opened for 4 years and apparently is still relevant to this day :)

rileykarson commented 1 year ago

@djboboch would you be able to share debug logs from a failing run? (Sanitized, since they'll be in public). This has been around a while, so I'd like to ensure we're still talking about the same issue.

melinath commented 6 months ago

It sounds like this is not SQL-specific, but may be related to private networking - so it should probably go to whichever team owns that.