Failuer reflecting in terraform : Error waiting for Delete Instance: couldn't find resource (21 retries) Cloudsql instances

aditya-facets commented 1 year ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

hashicorp/google-beta v4.33.0
hashicorp/google v4.33.0
Terraform v1.0.11 on linux_amd64

Affected Resource(s)

google_sql_database_instance

Expected Behavior

Destroy of the cloudsql mysql/postgres resource should be clean updating terraform state likewise.

Actual Behavior

Destroy of the cloudsql instance with 2 replica-reader was destroyed which got reflected in the console. But the status of replica instances are not reflected in the terraform logs, rather it fails throwing the error message Error, failed to delete instance <XYZ>-replicareader-0: Error waiting for Delete Instance: couldn't find resource (21 retries)

Console, does not show replica instances and only the master instaces shows up.

Steps to Reproduce

Using the module https://github.com/terraform-google-modules/terraform-google-sql-db/tree/v13.0.1/modules/postgresql for cloudsql - postgres SQL / mysql intialisation.

Used the configuration with 2 replica-reader with minimal input.
Applied via terraform
initiated a destroy

Important Factoids

error : Error waiting for Delete Instance: couldn't find resource (21 retries) is inconsistent, as this behaviour is noticed at a very random basis. Some times it destroys successfully but fails mostly.

References

b/299600745

edwardmedia commented 1 year ago

@aditya-facets since we don't own modules, can you repro the issue with resources? If yes, can you share the config?

legojesus commented 1 year ago

I'm also experiencing this issue. Here's my setup:

Main.tf:

resource "google_sql_database_instance" "main" {
  for_each            = var.sql_instances
  name                = each.value.sql_instance_name
  database_version    = each.value.db_version
  root_password       = each.value.root_password
  deletion_protection = each.value.deletion_protection

  settings {
    tier              = each.value.tier
    availability_type = each.value.availability_type
    disk_autoresize   = each.value.disk_autoresize
    disk_size         = each.value.disk_size

    ip_configuration {
      ipv4_enabled       = each.value.ipv4_enabled
      private_network    = each.value.ipv4_enabled == false ? "projects/${var.project_id}/global/networks/${var.vpc_name}" : null
      allocated_ip_range = each.value.ipv4_enabled == false ? "${var.vpc_name}-private-ip" : null
      require_ssl        = each.value.require_ssl
    }

    user_labels = each.value.user_labels

    backup_configuration {
      enabled                        = each.value.backup_enabled
      binary_log_enabled             = each.value.binary_log_enabled
      transaction_log_retention_days = each.value.transaction_log_retention_days

      backup_retention_settings {
        retention_unit   = each.value.retention_unit
        retained_backups = each.value.retained_backups
      }
    }
  }

}

resource "google_sql_database_instance" "read_replica" {
  depends_on           = [google_sql_database_instance.main]
  for_each             = var.sql_replicas
  name                 = each.value.replica_name
  database_version     = each.value.db_version
  root_password        = each.value.root_password
  deletion_protection  = each.value.deletion_protection
  master_instance_name = each.value.master_instance

  settings {
    tier              = each.value.tier
    availability_type = each.value.availability_type
    disk_autoresize   = each.value.disk_autoresize
    disk_size         = each.value.disk_size

    ip_configuration {
      ipv4_enabled       = each.value.ipv4_enabled
      private_network    = "projects/${var.project_id}/global/networks/${var.vpc_name}"
      allocated_ip_range = "${var.vpc_name}-private-ip"
    }

  }

Variables.tf:

variable "project_id" {
  description = "The name of GCP project."
  type        = string
  default     = "app-prod"
}

variable "region" {
  description = "The name of GCP region."
  type        = string
  default     = "me-west1"
}

variable "vpc_name" {
  description = "The name of the VPC in the project."
  type        = string
  default     = "app-prod-vpc"
}

variable "sql_instances" {
  description = "A map of SQL instances to deploy"
  type = map(object({
    sql_instance_name              = string
    db_version                     = string
    root_password                  = string
    deletion_protection            = bool
    user_labels                    = map(string)
    tier                           = string
    availability_type              = string
    disk_autoresize                = bool
    disk_size                      = number
    ipv4_enabled                   = bool
    require_ssl                    = bool
    backup_enabled                 = bool
    binary_log_enabled             = bool
    transaction_log_retention_days = number
    retention_unit                 = string
    retained_backups               = number
    db_name                        = string
    charset                        = string
    collation                      = string
  }))
  default = {}
}

variable "sql_replicas" {
  description = "A map of SQL read replicas to deploy"
  type = map(object({
    replica_name        = string
    master_instance     = string
    db_version          = string
    root_password       = string
    deletion_protection = bool
    tier                = string
    availability_type   = string
    disk_autoresize     = bool
    disk_size           = number
    ipv4_enabled        = bool
  }))
  default = {}
}

Terraform.tfvars:

sql_instances = {

  app_test = {
    sql_instance_name   = "app-test"
    db_version          = "MYSQL_5_7"  // MYSQL_5_6, MYSQL_5_7, MYSQL_8_0
    root_password       = "NewPass123"
    deletion_protection = false
    user_labels = {
      "env" = "prod"
      "app" = "test"
    }
    # Instance settings:
    tier              = "db-custom-1-3840"
    availability_type = "REGIONAL"         // REGIONAL or ZONAL. REGIONAL will make it High-availability.
    disk_autoresize   = true               // Automatically scale up hard drive when space runs out
    disk_size         = 10                 // Size in GB.

    # IP config:
    ipv4_enabled = false // Whether or not to create a public IP for this instance.
    require_ssl  = false

    # Backup config:
    backup_enabled                 = true
    binary_log_enabled             = true
    transaction_log_retention_days = 7
    retention_unit                 = "COUNT"
    retained_backups               = 7

    # DB of the main instance:
    db_name   = "app_db"
    charset   = "utf8"            // https://dev.mysql.com/doc/refman/5.7/en/charset-charsets.html
    collation = "utf8_general_ci" // https://dev.mysql.com/doc/refman/5.7/en/charset-charsets.html
  },
}

### Read replicas ###
sql_replicas = {

  app-test-replica = {
    replica_name        = "app-test-replica"
    master_instance     = "app-test" // The source SQL instance to replicate. Must match the name of the main instance.
    db_version          = "MYSQL_5_7" 
    root_password       = "NewPass123"
    deletion_protection = false       
    tier                = "db-custom-1-3840"
    availability_type   = "REGIONAL"         // REGIONAL or ZONAL
    disk_autoresize     = true               // Automatically scale up hard drive when space runs out
    disk_size           = 10                 // Size in GB.
    ipv4_enabled        = false              // Whether or not to create a public IP for this instance.
  },

  app-test-replica2 = {
    replica_name        = "app-test-replica2"
    master_instance     = "app-test"
    db_version          = "MYSQL_5_7"
    root_password       = "NewPass123"
    deletion_protection = false
    tier                = "db-custom-1-3840"
    availability_type   = "REGIONAL"
    disk_autoresize     = true
    disk_size           = 10
    ipv4_enabled        = false
  },
}

This applies successfully but when destroying, the 2nd replica does get destroyed, but terraform doesn't seem to receive a success status code so it keeps trying and then it just can't find the resource anymore: Error: Error, failed to delete instance app-test-replica2: Error waiting for Delete Instance: couldn't find resource (21 retries)

When performing a 2nd destroy, all goes well and the master and other replica gets destroyed properly.

edwardmedia commented 1 year ago

@aditya-facets I have tried the config like yours and it fails to run. Are you able to make it simpler?

legojesus commented 1 year ago

@edwardmedia Were you referring to the example I provided (accidentally tagging the original poster instead of me)?

edwardmedia commented 1 year ago

@legojesus you are right. My question was intended to you

I have managed to setup up your config. My testing turns out to be fine. Yours does have the dependency on the master for both replicas. During the destroy steps, I do see both replicas deleted first before the master. I do not know what had happened on yours. Do you have the debug log to share so I can take a closer look at yours?

legojesus commented 1 year ago

Thanks for the info @edwardmedia .

I did not get the log, but a few days ago I happen to have deployed a much simpler version, with which the issue occurred again:

Deploy the following main.tf:


resource "google_sql_database_instance" "main" {
name                = "main-test"
database_version    = "MYSQL_5_7"
deletion_protection = false
settings {
tier              = "db-f1-micro"

backup_configuration {
  enabled                        = true
  binary_log_enabled             = true
  transaction_log_retention_days = 7

  backup_retention_settings {
    retention_unit   = "COUNT"
    retained_backups = 7
  }
}
}

}

resource "google_sql_database_instance" "read_replica" { depends_on = [google_sql_database_instance.main] name = "replica1" database_version = "MYSQL_5_7" master_instance_name = google_sql_database_instance.main.name deletion_protection = false settings { tier = "db-f1-micro" } }

resource "google_sql_database_instance" "read_replica2" { depends_on = [google_sql_database_instance.main] name = "replica2" database_version = "MYSQL_5_7" master_instance_name = google_sql_database_instance.main.name deletion_protection = false settings { tier = "db-f1-micro" } }



2. Destroy after deployment is completed. 

You might want to change machine types in the instances because that little deployment took me an hour, and I think it is because the machine type might have something to do with it. 

The error will show up after about 10 minutes of trying to destroy the deployment. Another destroy action after the error will immediately delete the main instance without a problem.

sdif commented 1 year ago

Hello, had the same issue running Google provider 4.53.1 and terraform v1.3.7:

╷
│ Error: Error, failed to delete instance <replica1>: Error waiting for Delete Instance: couldn't find resource (21 retries)
│
│
╵
╷
│ Error: Error, failed to delete instance <replica2>: Error waiting for Delete Instance: couldn't find resource (21 retries)
│

Note that the replicas are deleted from our GCP project but terraform returned this error

edwardmedia commented 1 year ago

Looking into

edwardmedia commented 1 year ago

@legojesus Using your config, I have tried 5 times, none hit the same error.

Based on the error and the behavior, it appeared the api failed to return DONE properly. I am not sure what caused that in your case. Do you want to share the full debug log so I can take a look?

The default timeout for delete is 30 minutes. Just curious, did you see the error after 30 minutes? The error is different if timeout was hit.

legojesus commented 1 year ago

@edwardmedia Thanks for testing again. My terraform is 1.3.7 and I can still reproduce this on demand. Here's the latest log (rather long): log.txt

The delete doesn't take 30 minutes. After around 3-4 minutes of destroying the replicas, it starts getting the following response (according to the log):

{
  "error": {
    "code": 404,
    "message": "The Cloud SQL instance operation does not exist.",
    "errors": [
      {
        "message": "The Cloud SQL instance operation does not exist.",
        "domain": "global",
        "reason": "operationDoesNotExist"
      }
    ]
  }
}

After 10 minutes of trying, it gives up and throws the error mentioned in this discussion.

edwardmedia commented 1 year ago

@legojesus looking at below section, it appears the api behaves a little weird. If the delete operation is complete, shouldn't it return DONE, instead of operationDoesNotExist?

GET /sql/v1beta4/projects/test-prod/operations/d770bd46-c9cc-47c7-a06f-9c3900000053?alt=json&prettyPrint=false HTTP/1.1
Host: sqladmin.googleapis.com
User-Agent: google-api-go-client/0.5 Terraform/1.3.7 (+https://www.terraform.io) Terraform-Plugin-SDK/2.10.1 terraform-provider-google/dev
X-Goog-Api-Client: gl-go/1.18.1 gdcl/0.82.0
Accept-Encoding: gzip

-----------------------------------------------------: timestamp=2023-04-10T09:35:32.094+0300
2023-04-10T09:35:32.560+0300 [INFO]  provider.terraform-provider-google_v4.33.0_x5: 2023/04/10 09:35:32 [DEBUG] Google API Response Details:
---[ RESPONSE ]--------------------------------------
HTTP/2.0 200 OK
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
Cache-Control: private
Content-Type: application/json; charset=UTF-8
Date: Mon, 10 Apr 2023 06:35:32 GMT
Server: ESF
Vary: Origin
Vary: X-Origin
Vary: Referer
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 0

{
 "kind": "sql#operation",
 "targetLink": "https://sqladmin.googleapis.com/sql/v1beta4/projects/test-prod/instances/test-reader2",
 "status": "RUNNING",
 "user": "test@test.com",
 "insertTime": "2023-04-10T06:34:07.382Z",
 "startTime": "2023-04-10T06:34:07.544Z",
 "operationType": "DELETE",
 "name": "d770bd46-c9cc-47c7-a06f-9c3900000053",
 "targetId": "test-reader2",
 "selfLink": "https://sqladmin.googleapis.com/sql/v1beta4/projects/test-prod/operations/d770bd46-c9cc-47c7-a06f-9c3900000053",
 "targetProject": "test-prod"
}
-----------------------------------------------------: timestamp=2023-04-10T09:35:32.560+0300
2023-04-10T09:35:32.560+0300 [INFO]  provider.terraform-provider-google_v4.33.0_x5: 2023/04/10 09:35:32 [DEBUG] Retry Transport: Stopping retries, last request was successful: timestamp=2023-04-10T09:35:32.560+0300
2023-04-10T09:35:32.560+0300 [INFO]  provider.terraform-provider-google_v4.33.0_x5: 2023/04/10 09:35:32 [DEBUG] Retry Transport: Returning after 1 attempts: timestamp=2023-04-10T09:35:32.560+0300
2023-04-10T09:35:32.561+0300 [INFO]  provider.terraform-provider-google_v4.33.0_x5: 2023/04/10 09:35:32 [DEBUG] Got RUNNING while polling for operation d770bd46-c9cc-47c7-a06f-9c3900000053's status: timestamp=2023-04-10T09:35:32.560+0300
2023-04-10T09:35:32.561+0300 [INFO]  provider.terraform-provider-google_v4.33.0_x5: 2023/04/10 09:35:32 [TRACE] Waiting 10s before next try: timestamp=2023-04-10T09:35:32.560+0300
module.sql_db[0].google_sql_database_instance.read_replica["test-db-reader2"]: Still destroying... [id=test-reader2, 7m20s elapsed]
2023-04-10T09:35:42.564+0300 [INFO]  provider.terraform-provider-google_v4.33.0_x5: 2023/04/10 09:35:42 [DEBUG] Waiting for state to become: [success]: timestamp=2023-04-10T09:35:42.564+0300
2023-04-10T09:35:42.565+0300 [INFO]  provider.terraform-provider-google_v4.33.0_x5: 2023/04/10 09:35:42 [DEBUG] Retry Transport: starting RoundTrip retry loop: timestamp=2023-04-10T09:35:42.565+0300
2023-04-10T09:35:42.565+0300 [INFO]  provider.terraform-provider-google_v4.33.0_x5: 2023/04/10 09:35:42 [DEBUG] Retry Transport: request attempt 0: timestamp=2023-04-10T09:35:42.565+0300
2023-04-10T09:35:42.565+0300 [INFO]  provider.terraform-provider-google_v4.33.0_x5: 2023/04/10 09:35:42 [DEBUG] Google API Request Details:
---[ REQUEST ]---------------------------------------
GET /sql/v1beta4/projects/test-prod/operations/d770bd46-c9cc-47c7-a06f-9c3900000053?alt=json&prettyPrint=false HTTP/1.1
Host: sqladmin.googleapis.com
User-Agent: google-api-go-client/0.5 Terraform/1.3.7 (+https://www.terraform.io) Terraform-Plugin-SDK/2.10.1 terraform-provider-google/dev
X-Goog-Api-Client: gl-go/1.18.1 gdcl/0.82.0
Accept-Encoding: gzip

-----------------------------------------------------: timestamp=2023-04-10T09:35:42.565+0300
2023-04-10T09:35:45.010+0300 [INFO]  provider.terraform-provider-google_v4.33.0_x5: 2023/04/10 09:35:45 [DEBUG] Google API Response Details:
---[ RESPONSE ]--------------------------------------
HTTP/2.0 404 Not Found
Alt-Svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
Cache-Control: private
Content-Type: application/json; charset=UTF-8
Date: Mon, 10 Apr 2023 06:35:44 GMT
Server: ESF
Vary: Origin
Vary: X-Origin
Vary: Referer
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-Xss-Protection: 0

{
  "error": {
    "code": 404,
    "message": "The Cloud SQL instance operation does not exist.",
    "errors": [
      {
        "message": "The Cloud SQL instance operation does not exist.",
        "domain": "global",
        "reason": "operationDoesNotExist"
      }
    ]
  }
}

edwardmedia commented 1 year ago

b/278307339

legojesus commented 1 year ago

@edwardmedia You are correct, it should return DONE but for some reason it just retries until it throws the error.

Do you require any other info from me/my setup?

SamuelMolling commented 1 year ago

same error here

SamuelMolling commented 1 year ago

@legojesus Is there a palliative solution?

legojesus commented 1 year ago

@SamuelMolling Unfortunately no. The only way around this is to perform a 2nd terraform destroy operation, which then works well.

SamuelMolling commented 1 year ago

That's what we've been doing, but is there any way to fix it? Is there a front of it?

legojesus commented 1 year ago

I'm just a user like you, so I have no answer. @edwardmedia What does the "Upstream" label you've added do? Is this going to be addressed in the near future? Thank you.

SamuelMolling commented 1 year ago

I use terragrunt and solved it with auto retry, it's a tip 😀

hashicorp / terraform-provider-google