hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.79k stars 9.14k forks source link

panic when upgrading state schema of aws_db_instance #12771

Closed ryanking closed 4 years ago

ryanking commented 4 years ago

Community Note

Terraform Version

$ terraform -v
Terraform v0.12.24
+ provider.archive v1.3.0
+ provider.aws v2.56.0
+ provider.local v1.4.0
+ provider.null v2.1.2
+ provider.random v2.2.1
+ provider.template v2.1.2
+ provider.tls v2.1.1

Affected Resource(s)

Terraform Configuration Files

I think there is too much config to replicate here and it seems that the problem is purely related to state management.

Debug Output

https://gist.github.com/ca3825d27e669d6e42c4d3e40f9acaff

Panic Output

The crash is in the aws provider, not terraform. The above logs include an aws provider panic. Relevant log lines are:


panic: assignment to entry in nil map
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4: 
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4: goroutine 126 [running]:
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4: github.com/terraform-providers/terraform-provider-aws/aws.resourceAwsDbInstanceStateUpgradeV0(0x0, 0x5a20720, 0xc000300500, 0xc0000cdc30, 0x524ab40, 0xc000600578)
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4:  /opt/teamcity-agent/work/5d79fe75d4460a2f/src/github.com/terraform-providers/terraform-provider-aws/aws/resource_aws_db_instance_migrate.go:382 +0x5c
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4: github.com/hashicorp/terraform-plugin-sdk/internal/helper/plugin.(*GRPCProviderServer).upgradeJSONState(0xc000600040, 0x0, 0x0, 0xc000469300, 0xc000600578, 0x0, 0x0)
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4:  /opt/teamcity-agent/work/5d79fe75d4460a2f/src/github.com/terraform-providers/terraform-provider-aws/vendor/github.com/hashicorp/terraform-plugin-sdk/internal/helper/plugin/grpc_provider.go:395 +0xb5
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4: github.com/hashicorp/terraform-plugin-sdk/internal/helper/plugin.(*GRPCProviderServer).UpgradeResourceState(0xc000600040, 0x6eadb20, 0xc000bb0840, 0xc000141240, 0xc000600040, 0xc000bb0840, 0xc000e49a80)
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4:  /opt/teamcity-agent/work/5d79fe75d4460a2f/src/github.com/terraform-providers/terraform-provider-aws/vendor/github.com/hashicorp/terraform-plugin-sdk/internal/helper/plugin/grpc_provider.go:270 +0x285
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4: github.com/hashicorp/terraform-plugin-sdk/internal/tfplugin5._Provider_UpgradeResourceState_Handler(0x63a9bc0, 0xc000600040, 0x6eadb20, 0xc000bb0840, 0xc000c8f140, 0x0, 0x6eadb20, 0xc000bb0840, 0xc000ac9840, 0x19)
2020/04/10 14:17:00 [DEBUG] ReferenceTransformer: "module.anniec-vm-stack.aws_db_instance.db-dev[0]" references: []
2020/04/10 14:17:00 [TRACE] Completed graph transform *terraform.ReferenceTransformer (no changes)
2020/04/10 14:17:00 [TRACE] Executing graph transform *terraform.RootTransformer
2020/04/10 14:17:00 [TRACE] Completed graph transform *terraform.RootTransformer (no changes)
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4:  /opt/teamcity-agent/work/5d79fe75d4460a2f/src/github.com/terraform-providers/terraform-provider-aws/vendor/github.com/hashicorp/terraform-plugin-sdk/internal/tfplugin5/tfplugin5.pb.go:3117 +0x217
2020/04/10 14:17:00 [TRACE] vertex "module.anniec-vm-stack.aws_db_instance.db-dev": entering dynamic subgraph
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4: google.golang.org/grpc.(*Server).processUnaryRPC(0xc0003e6160, 0x6eceb20, 0xc0000a5800, 0xc000caab00, 0xc0005d8c90, 0xa2bb180, 0x0, 0x0, 0x0)
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4:  /opt/teamcity-agent/work/5d79fe75d4460a2f/src/github.com/terraform-providers/terraform-provider-aws/vendor/google.golang.org/grpc/server.go:995 +0x460
2020/04/10 14:17:00 [TRACE] dag/walk: updating graph
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4: google.golang.org/grpc.(*Server).handleStream(0xc0003e6160, 0x6eceb20, 0xc0000a5800, 0xc000caab00, 0x0)
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4:  /opt/teamcity-agent/work/5d79fe75d4460a2f/src/github.com/terraform-providers/terraform-provider-aws/vendor/google.golang.org/grpc/server.go:1275 +0xd97
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4: google.golang.org/grpc.(*Server).serveStreams.func1.1(0xc00015a170, 0xc0003e6160, 0x6eceb20, 0xc0000a5800, 0xc000caab00)
2020/04/10 14:17:00 [TRACE] dag/walk: added new vertex: "module.anniec-vm-stack.aws_db_instance.db-dev[0]"
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4:  /opt/teamcity-agent/work/5d79fe75d4460a2f/src/github.com/terraform-providers/terraform-provider-aws/vendor/google.golang.org/grpc/server.go:710 +0xbb
2020/04/10 14:17:00 [TRACE] Executing graph transform *terraform.ResourceCountTransformer
2020-04-10T14:17:00.637-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4: created by google.golang.org/grpc.(*Server).serveStreams.func1
2020-04-10T14:17:00.638-0700 [DEBUG] plugin.terraform-provider-aws_v2.56.0_x4:  /opt/teamcity-agent/work/5d79fe75d4460a2f/src/github.com/terraform-providers/terraform-provider-aws/vendor/google.golang.org/grpc/server.go:708 +0xa1

Expected Behavior

Not crash.

Actual Behavior

Crash

Steps to Reproduce

  1. terraform plan

Important Factoids

This was an attempt to upgrade the provider from 2.44.0 to 2.56.0.

References

bflad commented 4 years ago

Hi @ryanking 👋 Thank you for reporting this and sorry you ran into trouble here. The state upgrade code could definitely be more defensive in this situation by performing a nil check itself (or potentially an upstream fix in the Terraform Plugin SDK to prevent the passing of nil state to the upgrade function), but could you please provide us some additional details if possible?

Thank you.

ryanking commented 4 years ago

Do you know the history of this particular resource in the Terraform configuration? e.g. when it was created (Terraform AWS Provider version ideally)?

The resource is a few months old. It looks like it was created with version 2.17.0 of the the aws provider.

Was any configuration update part of updating the provider?

No.

Is the remote resource existing now or was the expectation this plan would recreate it?

The resource is existing (and has for awhile).

Were you able to workaround this scenario? e.g. terraform state rm and terraform import

I didn't look into any work arounds other than reverting the provider upgrade.

ryanking commented 4 years ago

I just tried state rm + import on the resource that I thought was the problem and am getting the same error.

I think it is highly likely that I don't know which resource is causing this. Do you have any pointers on figuring out which one it could be?

bflad commented 4 years ago

It'll be an "older" aws_db_instance resource (created prior to Terraform AWS Provider version 2.49.0), but not sure if the debug logs give more information prior to the execution of the state upgrade functions. Might be able to find out easier executing Terraform with -parallelism=1

Dingying0410 commented 4 years ago

Thanks @ryanking , and @bflad for looking into this.

Follow up: after reverting to provider.aws v2.44.0, if I run the plan, it will complain that

Error: Resource instance managed by newer provider version

The current state of module.xxx.aws_db_instance.db-dev[0] was created
by a newer provider version than is currently selected. Upgrade the
registry.terraform.io/-/aws provider to work with this state.

module.xxx.aws_db_instance.db-dev[0] is a aws_db_instance.

ryanking commented 4 years ago

I tried running with parallel=1 to find the resource that was causing this issue. Unfortunately it seems like it was all (~30) of the aws_db_instance resources in this component. I did a state rm + import for all of them and we no longer see any errors.

@bflad what would help debug this further? I have snapshots of the state before and after the re-import if that helps

bflad commented 4 years ago

@ryanking very strange. Are any of these true (recently, prior to the original panic)?

Just trying to rule out any odd behaviors that might be found in the state itself, its versioning, or the Terraform Plugin SDK handling (which is the logic responsible for running the resource state upgrades if necessary).

If you have a sanitized copy of the aws_db_instance Terraform configuration of one of these and its associated state prior to the panic and prior to any state rm/import operations, that could be immensely helpful.

Thank you for your information so far!

ryanking commented 4 years ago
  • terraform state mv for the whole module or individual resources

Not that I know of, and I am pretty sure I would know.

  • Using count/for_each with the resource

These databases are all created by a module which uses count to optionally create the database, so count is always 0 or 1.

  • Terraform 0.11 potentially involved at all?

Not any time recently. We use a bunch of 0.12-specific syntax so I don't think it would be possible to run 0.11 on our code base anymore. However many of these resources were created pre 0.12.

If you have a sanitized copy of the aws_db_instance Terraform configuration of one of these and its associated state prior to the panic and prior to any state rm/import operations, that could be immensely helpful.

Here is a sanitized example.

State before: https://gist.github.com/1ddc1d1edfcdef01d95ae48fea973394 State after: https://gist.github.com/ryanking/d225e2fcfc4b39fc925da6148a006479

For configuration, this is created in a module, and the code for that resource looks like:

resource "aws_db_instance" "db-dev" {
  count                      = var.skip_database ? 0 : 1
  identifier                 = "${var.db_fake_data ? "db-${var.username}" : "db-${var.username}-date-${var.db_date}-num-${var.db_override_num}"}"
  storage_type               = "gp2"
  engine                     = "postgres"
  engine_version             = "11.4"
  instance_class             = var.aws_db_instance_type
  port                       = 5432
  publicly_accessible        = false
  availability_zone          = var.aws_db_instance_availability_zone
  security_group_names       = []
  vpc_security_group_ids     = ["${var.db_security_group_id}"]
  db_subnet_group_name       = var.db_subnet_group_name
  parameter_group_name       = var.db_parameter_group_name
  auto_minor_version_upgrade = false
  multi_az                   = false
  backup_retention_period    = 0
  backup_window              = "10:10-10:40"
  maintenance_window         = "sat:16:00-sat:20:00"
  storage_encrypted          = true
  skip_final_snapshot        = true
  snapshot_identifier = "${var.db_fake_data ? "arn:aws:rds:us-west-2:950587841421:snapshot:fake-data-allocated-1000gb"
  : "arn:aws:rds:us-west-2:950587841421:snapshot:traject-${var.db_date}"}"
  ca_cert_identifier = "rds-ca-2019"

  lifecycle {
    create_before_destroy = true
  }

  apply_immediately = var.aws_db_instance_apply_immediately

  tags = {
    "Type"  = "Dev Database"
    "Owner" = var.username
  }
}
bflad commented 4 years ago

While we may not know the exact cause for it, the fix for the panic (returning an empty state when given an empty state) has been merged and will release with version 2.69.0 of the Terraform AWS Provider, Thursday next week. 👍

ghost commented 4 years ago

This has been released in version 2.69.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

ghost commented 4 years ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!