Regression issue on 1.5.1

michelzanini commented 3 years ago

After upgrading to 1.5.1 I am getting the following error:

Error: health check timeout: Head "https://elasticsearch.mydomain.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available

It could be related to aws_assume_role_arn as I use it on my provider config:

provider "elasticsearch" {
  url                 = "https://elasticsearch.mydomain.com"
  aws_region          = "eu-west-1"
  aws_profile         = ""
  aws_assume_role_arn = "arn:aws:iam::111111111:role/Role"
  sign_aws_requests   = true
}

It only seems to happen if I use aws_assume_role_arn and it does not when I use aws_profile. I am using Elasticsearch 7.9.

Reverting back to 1.5.0 and the error disappears.

I see there's significant changes done in this PR https://github.com/phillbaker/terraform-provider-elasticsearch/pull/119, maybe it's related.

Thanks.

phillbaker commented 3 years ago

Hello, sorry to hear you're having issues. It sounds like this might be related to https://github.com/phillbaker/terraform-provider-elasticsearch/commit/f924ab6c1d061db238dd49fab7365df9b89f7c3c (https://github.com/phillbaker/terraform-provider-elasticsearch/issues/114)

Thanks for providing details and an example provider config.

It only seems to happen if I use aws_assume_role_arn and it does not when I use aws_profile.

I'm not quite following here, are you saying that a different provider config does work in v1.5.1? (Can you share/clarify examples?)

michelzanini commented 3 years ago

I use Terragrunt to write a different Terraform file depending if I am on a CI environment or on a laptop.

When on a laptop, this is the config I use:

provider "elasticsearch" {
  url                 = "https://elasticsearch.mydomain.com"
  aws_region          = "eu-west-1"
  aws_profile         = "my_profile"
  aws_assume_role_arn = ""
  sign_aws_requests   = true
}

When on CI env, this is the one I use:

provider "elasticsearch" {
  url                 = "https://elasticsearch.mydomain.com"
  aws_region          = "eu-west-1"
  aws_profile         = ""
  aws_assume_role_arn = "arn:aws:iam::111111111:role/Role"
  sign_aws_requests   = true
}

On a laptop, it uses aws_profile. On CI server, it uses aws_assume_role_arn. On 1.5.0 both config files work. On 1.5.1, it seems only the laptop with aws_profile works.

phillbaker commented 3 years ago

Thanks @michelzanini. Any chance the CI is running on EKS (https://github.com/phillbaker/terraform-provider-elasticsearch/issues/112)?

michelzanini commented 3 years ago

No, it's running on a standard ec2 instance

Delorien84 commented 3 years ago

I can confirm that aws_assume_role_arn is not working on 1.5.1. It is running on EC2 instance with IAM role attached to that instance.

When I turn off healthcheck the execution block indefinitely .

My configuration is very similar:

provider "elasticsearch" {
  url                 = "https://custom.domain.com"
  aws_region          = "eu-west-1"
  aws_assume_role_arn = "arn:aws:iam::111111111:role/Role"
  sign_aws_requests   = true
}

lifeofguenter commented 3 years ago

For us we use aws_profile but it stopped working with 1.5.1:

provider "elasticsearch" {
  url               = "https://${module.logs_elasticsearch_remote.outputs.elasticsearch_endpoint}"
  aws_profile       = var.aws_profile
  sign_aws_requests = true
}

however, our profile looks like this:

[our-profile]
region            = us-east-1
credential_source = Ec2InstanceMetadata
role_arn          = arn:aws:iam::111111111111:role/ROLE_NAME

works fine on 1.5.0

phillbaker commented 3 years ago

Sorry for the delay here, I've reverted part of f924ab6c1d061db238dd49fab7365df9b89f7c3c and tagged a v1.5.2-beta (https://github.com/phillbaker/terraform-provider-elasticsearch/tree/v1.5.2-beta). That should get pushed to terraform registry shortly. Can you all please give that try and let me know if this is resolved?

phillbaker commented 3 years ago

Hello, following up on this. Has anyone been able to try v1.5.2-beta?

lifeofguenter commented 3 years ago

On our side it did not fix the issue unfortunately:

[2021-01-12T09:50:31.485Z] - Using phillbaker/elasticsearch v1.5.2-beta from the shared cache directory

[2021-01-12T09:51:07.021Z] Error: health check timeout: Head "https://sssss.us-east-1.es.amazonaws.com": RequestCanceled: request context canceled
[2021-01-12T09:51:07.021Z] caused by: context deadline exceeded: no Elasticsearch node available
[2021-01-12T09:51:07.021Z] 
[2021-01-12T09:51:07.021Z] 
[2021-01-12T09:51:07.021Z] 
[2021-01-12T09:51:07.021Z] Error: no active connection found: no Elasticsearch node available

reverting to 1.5.0 still works

phillbaker commented 3 years ago

Thanks. I reverted the upgrade of the AWS client and released v1.5.2-beta1, can folks on this thread give that a try and update here?

phillbaker commented 3 years ago

Hi all, following up on this, has this been fixedin 1.5.2-beta1?

phillbaker commented 3 years ago

HI all, 1.5.2 has been released, I'm going to close this as fixed - I don't have a way to reproduce, so I can't test directly. Please re-open if there are further issues.

michelzanini commented 3 years ago

Sorry I did not have time to test this before. I tested with 1.5.4 and it seems it still not working.

michelzanini commented 3 years ago

I can confirm the commit that introduced this regression issue was #119. I build binaries for every commit until it broke starting on that one.

I am going to have a deeper look now to see if I can spot the issue, but 100% it was there. @phillbaker

phillbaker commented 3 years ago

Thanks @michelzanini that's very helpful. That strikes me as very odd, as #119 is primarily a change in timing of calls, as opposed to what calls are being made.

In order to narrow down the issue, could you try the following:

try setting sniff to false in the provider config
try setting elasticsearch_version to the correct elasticsearch version to skip pinging the cluster when creating a client

michelzanini commented 3 years ago

Even with sniff and elasticsearch_version I still get the errors:

Error: health check timeout: Head "https://elasticsearch.mydomain.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available

  on main.tf line 8, in resource "elasticsearch_opendistro_role" "read_indexes_role":
   8: resource "elasticsearch_opendistro_role" "read_indexes_role" {

Error: health check timeout: Head "https://elasticsearch.mydomain.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available

  on main.tf line 59, in resource "elasticsearch_opendistro_user" "developer_users":
  59: resource "elasticsearch_opendistro_user" "developer_users" {

Error: no active connection found: no Elasticsearch node available

  on main.tf line 72, in resource "elasticsearch_opendistro_ism_policy" "ism_policy":
  72: resource "elasticsearch_opendistro_ism_policy" "ism_policy" {

If I also set healthchek to false, then there's no error but the resources are never created and Terraform keeps running indefinitely. All resources keep printing Still creating... [100...s elapsed] etc...

This leads me to believe that there's some sort of race condition. I can't find the problem myself and I do not have enough Go or Elasticsearch knowledge to find this on my own.

I will park this for now and keep locked to 1.5.0. Do you consider maybe reverting that PR #119 ?

Or else you can test this by creating one AWS instance and a Elasticsearch cluster, assign a IAM role to the box and run Terrafrom from there...

michelzanini commented 3 years ago

Not sure this will help but this is the logs that keeps like this forever:

(...)
elasticsearch_opendistro_role.read_indexes_role: Still creating... [40s elapsed]
2021/04/08 12:42:33 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:36 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:37 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
2021/04/08 12:42:38 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:41 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:42 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
elasticsearch_opendistro_role.read_indexes_role: Still creating... [50s elapsed]
2021/04/08 12:42:43 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:46 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:47 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
2021/04/08 12:42:48 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:51 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:52 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
elasticsearch_opendistro_role.read_indexes_role: Still creating... [60s elapsed]
2021/04/08 12:42:33 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:36 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:37 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
2021/04/08 12:42:38 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:41 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:42 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
(...)

phillbaker commented 3 years ago

Do you consider maybe reverting that PR #119 ?

Unfortunately, #119 touches too many pieces of code to revert now.

Or else you can test this by creating one AWS instance and a Elasticsearch cluster, assign a IAM role to the box and run Terrafrom from there...

I don't currently have access to an AWS environment where I can test this unfortunately.

phillbaker commented 3 years ago

Here's one guess I have: the deferred instantiation of the client means that the client is initialized once per resource, versus once at provider instantiation. This may be a problem if there are many resources (which also require reads to prepare a plan) and the AWS client needs to query resources like the EC2 metadata API (which is rate limited).

@michelzanini @lifeofguenter approximately how many elasticsearch_* resources are being managed in terraform?

michelzanini commented 3 years ago

Hi @phillbaker, that makes whole lot of sense. I have around 10 resources more or less. Although you don't have AWS resources to test, you can still probably test this behaviour with debugging?

lifeofguenter commented 3 years ago

we also did not have a lot of resources. Maybe around 10 as well.

We heavily monitored IMDS and other rate-limits as this was indeed a general issue, but was not the cause in this case - I think.

I dont think this can be tested easily though...

I would most probably look into how other providers utilize aws-sdk. I do know though especially for signed requests and ES that there are some additional quirks.

I am not actively using this provider anymore else I would invest some time. I think using earlier versions is just fine for most use cases.

michelzanini commented 3 years ago

I can confirm this has been fixed on 1.5.7.

marksumm commented 3 years ago

This may be fixed for aws_assume_role_arn but transparent role-based authentication via EC2 metadata is broken after 1.5.0 as well. Unfortunately, I need to upgrade because of other bugs that are only fixed in later versions of the provider.

phillbaker commented 3 years ago

transparent role-based authentication via EC2 metadata is broken after 1.5.0

Hi @marksumm can you clarify exactly the method that's being used here? What environmental variables are set? What EC2 metadata is being used?

marksumm commented 3 years ago

@phillbaker I meant a situation where no authentication attributes or environment variables are passed to the provider, healthchecks are disabled, and AWS request signing is enabled. Running locally uses the AWS credentials file as expected, but running on an EC2 instance now hangs indefinitely because state refreshes for resources created using the provider never return. The EC2 instance has an assumed role and so a session token is available via the metadata endpoint. Everything described was working in 1.5.0.

phillbaker commented 3 years ago

@marksumm please share the elasticsearch provider config that is working on 1.5.0 and not working in more recent versions. What url does the ES cluster have? And is it self hosted or in the AWS Elastic/Opensearch service?

marksumm commented 3 years ago

@phillbaker The provider is configured like this...

provider "elasticsearch" {
  url               = "https://********.us-east-1.es.amazonaws.com"
  sign_aws_requests = true
  healthcheck       = false
}

The endpoint is apparently Elasticsearch 7.7, but it seems that AWS have already started to make changes to the API following the switch to OpenSearch. For example, index patterns should now be nested inside ISM policies and not created as separate resources. By the way, I tried setting AWS_SDK_LOAD_CONFIG=1, but it didn't help.

marksumm commented 3 years ago

@phillbaker I've noticed that if I log in to an affected EC2 instance and target an individual resource created by this provider during terraform plan (and there are no dependencies on other resources), then the state refresh operation no longer hangs. If I attempt to target more than one resource created by this provider, or run an unmodified terraform plan, then I see the hanging behaviour as before. This is true even for a configuration with a very small number of resources (3), which seems to point to an internal deadlock, rather than an API limiting issue. Interestingly, setting -parallelism 1 doesn't seem to help.

phillbaker commented 3 years ago

Hi @marksumm this should be addressed in 64f21df, it'll be released in 2.0.0-beta.2 (coming shortly).

marksumm commented 3 years ago

@phillbaker It works! Thank you so much.

phillbaker / terraform-provider-elasticsearch

Regression issue on 1.5.1 #124