AWS Neptune Cluster from snapshot

kastlbo commented 2 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

Version 1.1.6 Provider 3.74.4. (Use this provider because we are having issues with s3 buckets with the newer provider)

Affected Resource(s)

aws_neptune_cluster

Terraform Configuration Files

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

I can't include configuration as I work for a company that would call it a security violation.

Debug Output

Panic Output

Expected Behavior

A neptune cluster should be created from a snapshot when you add in the snapshot_identifier tag.

Actual Behavior

Errors and won't create the cluster.

Cannot modify engine version without a healthy primary instance in DB cluster: sf-conncen-test-c3-graph-database
│   status code: 400, request id: 577cca5d-39af-4748-88eb-35980f6c0fa5
│ 
│   with module.c3_graph_database_neptune_cluster.aws_neptune_cluster.neptune_cluster[0],
│   on .terraform/modules/c3_graph_database_neptune_cluster/modules/neptune/cluster/main.tf line 3, in resource "aws_neptune_cluster" "neptune_cluster":
│    3: resource "aws_neptune_cluster" "neptune_cluster" {

Steps to Reproduce

We use a custom module built off of the resource listed above. I am having troubles creating a neptune cluster from a snapshot. Each time i run with the snapshot identifier it will fail:

Cannot modify engine version without a healthy primary instance in DB cluster: sf-conncen-test-c3-graph-database
│   status code: 400, request id: 577cca5d-39af-4748-88eb-35980f6c0fa5
│ 
│   with module.c3_graph_database_neptune_cluster.aws_neptune_cluster.neptune_cluster[0],
│   on .terraform/modules/c3_graph_database_neptune_cluster/modules/neptune/cluster/main.tf line 3, in resource "aws_neptune_cluster" "neptune_cluster":
│    3: resource "aws_neptune_cluster" "neptune_cluster" {

The documentation for this resource says that you can create the cluster from the snapshot by adding in the snapshot_identifier tag. But it doesn't work as advertised. I have tried checking the terraform without the identifier and the cluster is created without any problems. I have read the issue board and i don't see anyone else with this problem. I did see some talk about encryption being the problem but it really didn't apply to neptune. To reproduce just create a snapshot and try to get the terraform to create it.

terraform apply

Important Factoids

All terraform is run in a ci/cd pipeline using TFE.

References

0000

bschaatsbergen commented 2 years ago

Going for a short holiday, back on monday, but I'll happily look into this.

georgedivya commented 2 years ago

I'm also having issues with creating a cluster from the snapshot. I'm recreating the cluster from a snapshot and it fails on cluster creation with the following error:

DBClusterRoleAlreadyExists: Role ARN <role arn> is already associated with DB Cluster: Verify your role ARN and try again.

The same code works without specifying a snapshot identifier.

jaw111 commented 2 years ago

@kastlbo given the error you reported, it would be worthwhile to check the Neptune engine version used for the snapshot and try using the same engine version in the new cluster.

wolli-lenzen commented 2 years ago

Same problem here - when I try to create Neptune-Cluster from an existing snapshot it runs into Error

"InvalidDBClusterStateFault: Cannot modify engine version without a healthy primary instance in DB cluster:"

For sure, engine version of snapshot and engine verion of the newly created cluster are the same.

wolli-lenzen commented 2 years ago

@bschaatsbergen, @justinretzolk do you see any chance to get this fixed quite soon? It is identified as bug since march and is stopping us to write and test desaster recovery code for DB

justinretzolk commented 2 years ago

Hey @wolli-lenzen 👋 Thank you for checking in on this. Unfortunately, I'm not able to provide an estimate on when this will be looked into due to the potential of shifting priorities (we prioritize work by count of ":+1:" reactions, as well as a few other things). For more information on how we prioritize, check out out prioritization guide.

nikunjundhad commented 2 years ago

Terraform version: 1.2.3 AWS provider version: 4.26.0 And still reproducible, this is deadlock for our disaster recovery. With this behaviour we can't recover our DB cluster if something goes wrong, and in this case we can't rely on terraform for our DB management. FYI similar type of issue is observed for RDS aws_db_cluster_snapshot as well, there also our cluster creation from snapshot not going healthy.

aws_neptune_cluster.neptune-db-n: Still creating... [16m41s elapsed]
aws_neptune_cluster.neptune-db-n: Still creating... [16m51s elapsed]
╷
│ Error: Failed to modify Neptune Cluster (ae-sbx-neptune-cluster-new): InvalidDBClusterStateFault: Cannot modify engine version without a healthy primary instance in DB cluster: ae-sbx-neptune-cluster-new
│   status code: 400, request id: 89261a6a-3ee7-4406-807e-24e0e02b4523
│
│   with aws_neptune_cluster.neptune-db-n,
│   on neptune-cluster.tf line 37, in resource "aws_neptune_cluster" "neptune-db-n":
│   37: resource "aws_neptune_cluster" "neptune-db-n" {
│
╵

---- update ---- When I removed engine_version from resource it successfully created new cluster with latest version available, however our old DB cluster is running with older version and snapshot is also showing older version. And when I added it again it will start failing with similar error. So definitely issue is with when we specify version.

After above error cluster on AWS console showing with status Available Also very wired thing is about state after above terraform apply, check below is state for resource name neptune-db-n and many required field values are null for example arn

{
      "mode": "managed",
      "type": "aws_neptune_cluster",
      "name": "neptune-db-n",
      "provider": "provider[\"registry.terraform.io/hashicorp/aws\"]",
      "instances": [
        {
          "status": "tainted",
          "schema_version": 0,
          "attributes": {
            "allow_major_version_upgrade": null,
            "apply_immediately": true,
            "arn": null,
            "availability_zones": [
              "us-east-1a",
              "us-east-1b",
              "us-east-1c"
            ],
            "backup_retention_period": 5,
            "cluster_identifier": "ae-sbx-neptune-cluster-new",
            "cluster_identifier_prefix": null,
            "cluster_members": [],
            "cluster_resource_id": null,
            "copy_tags_to_snapshot": false,
            "deletion_protection": null,
            "enable_cloudwatch_logs_exports": null,
            "endpoint": null,
            "engine": "neptune",
            "engine_version": "1.0.5.1",
            "final_snapshot_identifier": null,
            "hosted_zone_id": null,
            "iam_database_authentication_enabled": false,
            "iam_roles": null,
            "id": "ae-sbx-neptune-cluster-new",
            "kms_key_arn": null,
            "neptune_cluster_parameter_group_name": "default.neptune1",
            "neptune_subnet_group_name": null,
            "port": 8182,
            "preferred_backup_window": "07:00-09:00",
            "preferred_maintenance_window": null,
            "reader_endpoint": null,
            "replication_source_identifier": null,
            "skip_final_snapshot": true,
            "snapshot_identifier": "arn:aws:rds:us-east-1:503330882943:cluster-snapshot:ae-sbx-neptune-db-snap-18aug",
            "storage_encrypted": false,
            "tags": null,
            "tags_all": null,
            "timeouts": null,
            "vpc_security_group_ids": []
          },
          "sensitive_attributes": [],
          "private": "eyJlMmJmYjczMC1lY2FhLTExZTYtOGY4OC0zNDM2M2JjN2M0YzAiOnsiY3JlYXRlIjo3MjAwMDAwMDAwMDAwLCJkZWxldGUiOjcyMDAwMDAwMDAwMDAsInVwZGF0ZSI6NzIwMDAwMDAwMDAwMH19",
          "dependencies": [
            "aws_neptune_cluster_snapshot.snapshot-18Aug"
          ]
        }
      ]
    }

Also when I show plan after last apply without any change, it's always replacing the cluster and we are in endless loop of cluster re-creation. Check below plan.

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # aws_neptune_cluster.neptune-db-n is tainted, so must be replaced
-/+ resource "aws_neptune_cluster" "neptune-db-n" {
      + allow_major_version_upgrade          = (known after apply)
      ~ arn                                  = "arn:aws:rds:us-east-1:503330882943:cluster:ae-sbx-neptune-cluster-new" -> (known after apply)
      + cluster_identifier_prefix            = (known after apply)
      ~ cluster_members                      = [] -> (known after apply)
      ~ cluster_resource_id                  = "cluster-MGAITJE2YI5J7P4KRPX6LG7YAY" -> (known after apply)
      - deletion_protection                  = false -> null
      - enable_cloudwatch_logs_exports       = [] -> null
      ~ endpoint                             = "ae-sbx-neptune-cluster-new.cluster-czhruub712uf.us-east-1.neptune.amazonaws.com" -> (known after apply)
      ~ hosted_zone_id                       = "ZUFXD4SLT2LS7" -> (known after apply)
      - iam_roles                            = [] -> null
      ~ id                                   = "ae-sbx-neptune-cluster-new" -> (known after apply)
      + kms_key_arn                          = (known after apply)
      ~ neptune_subnet_group_name            = "default" -> (known after apply)
      ~ preferred_maintenance_window         = "mon:05:39-mon:06:09" -> (known after apply)
      ~ reader_endpoint                      = "ae-sbx-neptune-cluster-new.cluster-ro-czhruub712uf.us-east-1.neptune.amazonaws.com" -> (known after apply)
      - tags                                 = {} -> null
      ~ tags_all                             = {} -> (known after apply)
      ~ vpc_security_group_ids               = [
          - "sg-03f5bd30",
        ] -> (known after apply)
        # (14 unchanged attributes hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

@bschaatsbergen @justinretzolk can you guys please check this on priority? As this is showstopper for us and many others. If there is a way to make this high priority let us know we will like to do that. Thanks in advance for your guidance. If issue is already identified and if there is any workaround till you properly fix it, please let us know that so we can unblock our self and move ahead. Thanks. @Danielcweber already raised a pull request, when can we expect that will be merged into main? https://github.com/hashicorp/terraform-provider-aws/pull/25982

nikunjundhad commented 2 years ago

Any update on this issue guys?

roshanjoseph23 commented 2 years ago

trying to restore snapshot using terraform and always ending up with DBClusterRoleAlreadyExists error. Tried with same engine version used in snapshot, and even not specifying the engine version isn't helping.

tgourley01 commented 2 years ago

Wish this would get some attention. This is a non-starter for production environments. Are there any known workarounds? older provider versions maybe?

I see @danielcweber has raised pull request https://github.com/hashicorp/terraform-provider-aws/pull/25982, can it be merged?

slatsinoglou commented 2 years ago

We are experiencing the same issue. Is there any update on this?

pluksha commented 1 year ago

The same issue. Are there any updates?

vgarkusha commented 1 year ago

Same issue with provisioning neptune from snapshot like @georgedivya have. Any updates around it?

danielcweber commented 1 year ago

Make sure you upvote the proposed PR https://github.com/hashicorp/terraform-provider-aws/pull/25982 if it works for you.

LennyCastaneda commented 1 year ago

Having the same issue here...what is the status of this solution?

Tom-Carpenter commented 1 year ago

This coupled with this https://github.com/hashicorp/terraform-provider-aws/issues/15563 makes the user experience for terraform pretty poor

joeynaor commented 1 year ago

Any updates on this? This issue completely breaks our DR pipeline

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

hashicorp / terraform-provider-aws