hashicorp / terraform-provider-aws

The AWS Provider enables Terraform to manage AWS resources.
https://registry.terraform.io/providers/hashicorp/aws
Mozilla Public License 2.0
9.83k stars 9.18k forks source link

Failed ES domain upgrade error isn't helpful #11061

Open tomelliff opened 4 years ago

tomelliff commented 4 years ago

Community Note

Terraform Version

Terraform v0.12.10

Affected Resource(s)

Terraform Configuration Files

# Copy-paste your Terraform configurations here - for large Terraform configs,
# please use a service like Dropbox and share a link to the ZIP file. For
# security, you can also encrypt the files using our GPG public key: https://keybase.io/hashicorp

Debug Output

The relevant part of the debug log is small so posting it directly here:

2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: 2019/11/28 22:56:01 [DEBUG] [aws-sdk-go] DEBUG: Response es/GetUpgradeStatus Details:
2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: ---[ RESPONSE ]--------------------------------------
2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: HTTP/1.1 200 OK
2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: Connection: close
2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: Content-Length: 97
2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: Content-Type: application/json
2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: Date: Thu, 28 Nov 2019 22:56:00 GMT
2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: X-Amzn-Requestid: 3f850bbc-1232-11ea-bc06-1fdf099cbf0b
2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: 
2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: 
2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: -----------------------------------------------------
2019-11-28T22:56:01.273Z [DEBUG] plugin.terraform-provider-aws_v2.33.99_x4: 2019/11/28 22:56:01 [DEBUG] [aws-sdk-go] {"StepStatus":"FAILED","UpgradeName":"Upgrade from 6.8 to 7.1","UpgradeStep":"PRE_UPGRADE_CHECK"}
2019/11/28 22:56:01 [DEBUG] module.elasticsearch.aws_elasticsearch_domain.elasticsearch: apply errored, but we're indicating that via the Error pointer rather than returning it: unexpected state 'FAILED', wanted target 'SUCCEEDED'. last error: %!s(<nil>)

Expected Behavior

My cluster is failing the upgrade eligibility checks but I'd expect to see the error correctly reported by Terraform with something like the following:

Cluster has 1160.0 shards per node which exceeds the setting cluster.max_shards_per_node value 1000

Actual Behavior

Error: unexpected state 'FAILED', wanted target 'SUCCEEDED'. last error: %!s(<nil>)

Steps to Reproduce

  1. Get an ES cluster in a position that it can't be upgraded for whatever reason
  2. Set the Terraform config to upgrade the version via a valid in place upgrade path
  3. terraform apply

Important Factoids

I've moved from a 2 AZ ES cluster to a 3 AZ ES cluster in place and then immediately moved to 6.8 and then attempted to again upgrade to 7.2 but this is causing the above error on the AWS side. That bit is fine but I'd expect Terraform to properly show the error instead of %!s(<nil>)

I wrote this in place upgrade code but didn't have a good way of inducing an upgrade failure so couldn't really test what happened in that case but it looks like AWS's API doesn't return an error, just the FAILED StepStatus field. The GetUpgradeHistory API endpoint will show the results of any attempted upgrades in reverse chronological order so it's possible we could retrieve the first failed result from that for the domain and return the list of UpgradeStepItem.Issues.

I am wary that I don't know a good way to force an ES cluster into a bad state though so this might be tricky to test once my ES cluster is back in to a good place.

References

justinretzolk commented 2 years ago

Hey @tomelliff šŸ‘‹ Thank you for taking the time to file this issue! Given that there's been a number of AWS provider releases since you initially filed it, can you confirm whether you're still experiencing this behavior?

obourdon commented 2 years ago

@justinretzolk just got it reproduced with Terraform AWS provider 3.75.1 (latest pre 4.0 version)

My branch fixes this as follows

Error: error waiting for Elasticsearch Domain Upgrade (arn:aws:es:eu-west-1:614455314739:domain/logs) to succeed: Upgrade from 6.8 to 7.10 FAILED: PRE_UPGRADE_CHECK

still working on adding appropriate tests and more insights as well as running regression tests

obourdon commented 2 years ago

With a new commit in my branch above I was also able to retrieve more detailed information as follows:

Error: error waiting for Elasticsearch Domain Upgrade (arn:aws:es:eu-west-1:614455314739:domain/logs) to succeed: Upgrade from 6.8 to 7.10 FAILED: PRE_UPGRADE_CHECK

    Cluster has 1491 shards per node which exceeds the setting cluster.max_shards_per_node value 1000
obourdon commented 2 years ago

Hi Hashicorp / AWS TF provider core team.

in the past I have submitted some patches against the master repo but my fixed branch is currently based on tag 3.75.1

What would be the appropriate method to submit my fix for this issue please ?

Should I try to cherry-pick the changes in the master ? Many thanks for any insight

obourdon commented 2 years ago

So far I was not able to successfully run the regression tests agains us-west-1 zone:

=== CONT  TestAccElasticsearchDomainDataSource_Data_basic
=== CONT  TestAccElasticsearchDomain_AdvancedSecurityOptions_userDB
--- PASS: TestAccElasticsearchDomainDataSource_Data_basic (1524.16s)
=== CONT  TestAccElasticsearchDomain_policyIgnoreEquivalent
--- PASS: TestAccElasticsearchDomain_AdvancedSecurityOptions_userDB (1542.69s)
=== CONT  TestAccElasticsearchDomain_disappears
--- PASS: TestAccElasticsearchDomain_policyIgnoreEquivalent (1450.18s)
=== CONT  TestAccElasticsearchDomain_Update_version
--- PASS: TestAccElasticsearchDomain_disappears (1515.88s)
=== CONT  TestAccElasticsearchDomain_WithVolumeType_missing
--- PASS: TestAccElasticsearchDomain_WithVolumeType_missing (1192.05s)
=== CONT  TestAccElasticsearchDomain_UpdateVolume_type
--- PASS: TestAccElasticsearchDomain_Update_version (4165.17s)
=== CONT  TestAccElasticsearchDomain_update
--- PASS: TestAccElasticsearchDomain_UpdateVolume_type (3254.57s)
=== CONT  TestAccElasticsearchDomain_tags
--- PASS: TestAccElasticsearchDomain_tags (2097.99s)
=== CONT  TestAccElasticsearchDomain_nodeToNodeEncryption
--- PASS: TestAccElasticsearchDomain_update (2706.42s)
=== CONT  TestAccElasticsearchDomain_EncryptAtRestSpecify_key
--- PASS: TestAccElasticsearchDomain_nodeToNodeEncryption (1289.54s)
=== CONT  TestAccElasticsearchDomain_EncryptAtRestDefault_key
--- PASS: TestAccElasticsearchDomain_EncryptAtRestSpecify_key (1253.09s)
=== CONT  TestAccElasticsearchDomain_Cluster_zoneAwareness
    domain_test.go:146: Step 1/5 error: Error running apply: exit status 1
        2022/03/30 14:35:29 [DEBUG] Using modified User-Agent: Terraform/0.12.31 HashiCorp-terraform-exec/0.15.0

        Error: Error creating Elasticsearch domain: DisabledOperationException: You don't have permission to select three availability zones

          on terraform_plugin_test.tf line 2, in resource "aws_elasticsearch_domain" "test":
           2: resource "aws_elasticsearch_domain" "test" {

--- FAIL: TestAccElasticsearchDomain_Cluster_zoneAwareness (9.07s)
=== CONT  TestAccElasticsearchDomain_AutoTuneOptions
--- PASS: TestAccElasticsearchDomain_EncryptAtRestDefault_key (1299.12s)
=== CONT  TestAccElasticsearchDomain_internetToVPCEndpoint
--- PASS: TestAccElasticsearchDomain_AutoTuneOptions (1623.00s)
=== CONT  TestAccElasticsearchDomain_VPC_update
panic: test timed out after 4h0m0s

I moved from 3h to 4h without more success (making parallelism set to 2 because of my laptop constraints). I will increase this a give it another try

obourdon commented 2 years ago

For the 3 zones error I just found out that us-west-1 is only 2 zone will change to us-west-2 (4 zones)

obourdon commented 2 years ago

:-( just a little bit more luck after 8h on us-west-2:

at least the previously failing test passed successfully

=== CONT  TestAccElasticsearchDomainDataSource_Data_basic
=== CONT  TestAccElasticsearchDomain_Update_version
--- PASS: TestAccElasticsearchDomainDataSource_Data_basic (1741.86s)
=== CONT  TestAccElasticsearchDomain_AutoTuneOptions
--- PASS: TestAccElasticsearchDomain_AutoTuneOptions (1723.72s)
=== CONT  TestAccElasticsearchDomain_WithVolumeType_missing
--- PASS: TestAccElasticsearchDomain_Update_version (4243.93s)
=== CONT  TestAccElasticsearchDomain_UpdateVolume_type
--- PASS: TestAccElasticsearchDomain_WithVolumeType_missing (1181.96s)
=== CONT  TestAccElasticsearchDomain_update
--- PASS: TestAccElasticsearchDomain_update (2638.81s)
=== CONT  TestAccElasticsearchDomain_tags
--- PASS: TestAccElasticsearchDomain_UpdateVolume_type (3716.47s)
=== CONT  TestAccElasticsearchDomain_nodeToNodeEncryption
--- PASS: TestAccElasticsearchDomain_tags (1403.58s)
=== CONT  TestAccElasticsearchDomain_EncryptAtRestSpecify_key
--- PASS: TestAccElasticsearchDomain_EncryptAtRestSpecify_key (1374.62s)
=== CONT  TestAccElasticsearchDomain_EncryptAtRestDefault_key
--- PASS: TestAccElasticsearchDomain_nodeToNodeEncryption (2155.89s)
=== CONT  TestAccElasticsearchDomain_policyIgnoreEquivalent
--- PASS: TestAccElasticsearchDomain_policyIgnoreEquivalent (1289.11s)
=== CONT  TestAccElasticsearchDomain_policy
--- PASS: TestAccElasticsearchDomain_EncryptAtRestDefault_key (1399.39s)
=== CONT  TestAccElasticsearchDomain_cognitoOptionsUpdate
--- PASS: TestAccElasticsearchDomain_policy (1249.13s)
=== CONT  TestAccElasticsearchDomain_cognitoOptionsCreateAndRemove
--- PASS: TestAccElasticsearchDomain_cognitoOptionsUpdate (2470.77s)
=== CONT  TestAccElasticsearchDomain_LogPublishingOptions_auditLogs
--- PASS: TestAccElasticsearchDomain_cognitoOptionsCreateAndRemove (2913.92s)
=== CONT  TestAccElasticsearchDomain_LogPublishingOptions_esApplicationLogs
--- PASS: TestAccElasticsearchDomain_LogPublishingOptions_auditLogs (1943.01s)
=== CONT  TestAccElasticsearchDomain_LogPublishingOptions_searchSlowLogs
--- PASS: TestAccElasticsearchDomain_LogPublishingOptions_esApplicationLogs (1630.03s)
=== CONT  TestAccElasticsearchDomain_disappears
--- PASS: TestAccElasticsearchDomain_LogPublishingOptions_searchSlowLogs (1641.53s)
=== CONT  TestAccElasticsearchDomain_LogPublishingOptions_indexSlowLogs
--- PASS: TestAccElasticsearchDomain_disappears (1311.56s)
=== CONT  TestAccElasticsearchDomain_AdvancedSecurityOptions_disabled
--- PASS: TestAccElasticsearchDomain_LogPublishingOptions_indexSlowLogs (1597.73s)
=== CONT  TestAccElasticsearchDomain_AdvancedSecurityOptions_userDB
--- PASS: TestAccElasticsearchDomain_AdvancedSecurityOptions_disabled (1828.39s)
=== CONT  TestAccElasticsearchDomain_customEndpoint
--- PASS: TestAccElasticsearchDomain_AdvancedSecurityOptions_userDB (1547.48s)
=== CONT  TestAccElasticsearchDomain_internetToVPCEndpoint
--- PASS: TestAccElasticsearchDomain_customEndpoint (3026.41s)
=== CONT  TestAccElasticsearchDomain_AdvancedSecurityOptions_iam
--- PASS: TestAccElasticsearchDomain_internetToVPCEndpoint (3265.56s)
=== CONT  TestAccElasticsearchDomainSamlOptions_disappears_Domain
--- PASS: TestAccElasticsearchDomain_AdvancedSecurityOptions_iam (1680.43s)
=== CONT  TestAccElasticsearchDomain_requireHTTPS
--- PASS: TestAccElasticsearchDomainSamlOptions_disappears_Domain (1436.80s)
=== CONT  TestAccElasticsearchDomain_basic
--- PASS: TestAccElasticsearchDomain_basic (1493.76s)
=== CONT  TestAccElasticsearchDomainSamlOptions_Disabled
--- PASS: TestAccElasticsearchDomain_requireHTTPS (2650.19s)
=== CONT  TestAccElasticsearchDomainSamlOptions_Update
--- PASS: TestAccElasticsearchDomainSamlOptions_Disabled (1682.28s)
=== CONT  TestAccElasticsearchDomain_VPC_update
panic: test timed out after 8h0m0s
obourdon commented 2 years ago

any insights on this please ?

obourdon commented 2 years ago

Can someone help with this please ?

obourdon commented 1 year ago

Anyone ?