Closed michelzanini closed 3 years ago
Hello, sorry to hear you're having issues. It sounds like this might be related to https://github.com/phillbaker/terraform-provider-elasticsearch/commit/f924ab6c1d061db238dd49fab7365df9b89f7c3c (https://github.com/phillbaker/terraform-provider-elasticsearch/issues/114)
Thanks for providing details and an example provider config.
It only seems to happen if I use aws_assume_role_arn and it does not when I use aws_profile.
I'm not quite following here, are you saying that a different provider config does work in v1.5.1? (Can you share/clarify examples?)
I use Terragrunt to write a different Terraform file depending if I am on a CI environment or on a laptop.
When on a laptop, this is the config I use:
provider "elasticsearch" {
url = "https://elasticsearch.mydomain.com"
aws_region = "eu-west-1"
aws_profile = "my_profile"
aws_assume_role_arn = ""
sign_aws_requests = true
}
When on CI env, this is the one I use:
provider "elasticsearch" {
url = "https://elasticsearch.mydomain.com"
aws_region = "eu-west-1"
aws_profile = ""
aws_assume_role_arn = "arn:aws:iam::111111111:role/Role"
sign_aws_requests = true
}
On a laptop, it uses aws_profile
. On CI server, it uses aws_assume_role_arn
.
On 1.5.0
both config files work.
On 1.5.1
, it seems only the laptop with aws_profile
works.
Thanks @michelzanini. Any chance the CI is running on EKS (https://github.com/phillbaker/terraform-provider-elasticsearch/issues/112)?
No, it's running on a standard ec2 instance
I can confirm that aws_assume_role_arn
is not working on 1.5.1
. It is running on EC2 instance with IAM role attached to that instance.
When I turn off healthcheck
the execution block indefinitely .
My configuration is very similar:
provider "elasticsearch" {
url = "https://custom.domain.com"
aws_region = "eu-west-1"
aws_assume_role_arn = "arn:aws:iam::111111111:role/Role"
sign_aws_requests = true
}
For us we use aws_profile
but it stopped working with 1.5.1
:
provider "elasticsearch" {
url = "https://${module.logs_elasticsearch_remote.outputs.elasticsearch_endpoint}"
aws_profile = var.aws_profile
sign_aws_requests = true
}
however, our profile looks like this:
[our-profile]
region = us-east-1
credential_source = Ec2InstanceMetadata
role_arn = arn:aws:iam::111111111111:role/ROLE_NAME
works fine on 1.5.0
Sorry for the delay here, I've reverted part of f924ab6c1d061db238dd49fab7365df9b89f7c3c and tagged a v1.5.2-beta (https://github.com/phillbaker/terraform-provider-elasticsearch/tree/v1.5.2-beta). That should get pushed to terraform registry shortly. Can you all please give that try and let me know if this is resolved?
Hello, following up on this. Has anyone been able to try v1.5.2-beta
?
On our side it did not fix the issue unfortunately:
[2021-01-12T09:50:31.485Z] - Using phillbaker/elasticsearch v1.5.2-beta from the shared cache directory
[2021-01-12T09:51:07.021Z] Error: health check timeout: Head "https://sssss.us-east-1.es.amazonaws.com": RequestCanceled: request context canceled
[2021-01-12T09:51:07.021Z] caused by: context deadline exceeded: no Elasticsearch node available
[2021-01-12T09:51:07.021Z]
[2021-01-12T09:51:07.021Z]
[2021-01-12T09:51:07.021Z]
[2021-01-12T09:51:07.021Z] Error: no active connection found: no Elasticsearch node available
reverting to 1.5.0
still works
Thanks. I reverted the upgrade of the AWS client and released v1.5.2-beta1
, can folks on this thread give that a try and update here?
Hi all, following up on this, has this been fixedin 1.5.2-beta1?
HI all, 1.5.2 has been released, I'm going to close this as fixed - I don't have a way to reproduce, so I can't test directly. Please re-open if there are further issues.
Sorry I did not have time to test this before. I tested with 1.5.4 and it seems it still not working.
I can confirm the commit that introduced this regression issue was #119. I build binaries for every commit until it broke starting on that one.
I am going to have a deeper look now to see if I can spot the issue, but 100% it was there. @phillbaker
Thanks @michelzanini that's very helpful. That strikes me as very odd, as #119 is primarily a change in timing of calls, as opposed to what calls are being made.
In order to narrow down the issue, could you try the following:
sniff
to false in the provider configelasticsearch_version
to the correct elasticsearch version to skip pinging the cluster when creating a clientEven with sniff
and elasticsearch_version
I still get the errors:
Error: health check timeout: Head "https://elasticsearch.mydomain.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
on main.tf line 8, in resource "elasticsearch_opendistro_role" "read_indexes_role":
8: resource "elasticsearch_opendistro_role" "read_indexes_role" {
Error: health check timeout: Head "https://elasticsearch.mydomain.com": RequestCanceled: request context canceled
caused by: context deadline exceeded: no Elasticsearch node available
on main.tf line 59, in resource "elasticsearch_opendistro_user" "developer_users":
59: resource "elasticsearch_opendistro_user" "developer_users" {
Error: no active connection found: no Elasticsearch node available
on main.tf line 72, in resource "elasticsearch_opendistro_ism_policy" "ism_policy":
72: resource "elasticsearch_opendistro_ism_policy" "ism_policy" {
If I also set healthchek
to false, then there's no error but the resources are never created and Terraform keeps running indefinitely. All resources keep printing Still creating... [100...s elapsed]
etc...
This leads me to believe that there's some sort of race condition. I can't find the problem myself and I do not have enough Go or Elasticsearch knowledge to find this on my own.
I will park this for now and keep locked to 1.5.0. Do you consider maybe reverting that PR #119 ?
Or else you can test this by creating one AWS instance and a Elasticsearch cluster, assign a IAM role to the box and run Terrafrom from there...
Not sure this will help but this is the logs that keeps like this forever:
(...)
elasticsearch_opendistro_role.read_indexes_role: Still creating... [40s elapsed]
2021/04/08 12:42:33 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:36 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:37 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
2021/04/08 12:42:38 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:41 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:42 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
elasticsearch_opendistro_role.read_indexes_role: Still creating... [50s elapsed]
2021/04/08 12:42:43 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:46 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:47 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
2021/04/08 12:42:48 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:51 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:52 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
elasticsearch_opendistro_role.read_indexes_role: Still creating... [60s elapsed]
2021/04/08 12:42:33 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:36 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:37 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
2021/04/08 12:42:38 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:41 [TRACE] dag/walk: vertex "provider[\"registry.terraform.io/phillbaker/elasticsearch\"] (close)" is waiting for "elasticsearch_opendistro_role.read_indexes_role"
2021/04/08 12:42:42 [TRACE] dag/walk: vertex "root" is waiting for "meta.count-boundary (EachMode fixup)"
(...)
Do you consider maybe reverting that PR #119 ?
Unfortunately, #119 touches too many pieces of code to revert now.
Or else you can test this by creating one AWS instance and a Elasticsearch cluster, assign a IAM role to the box and run Terrafrom from there...
I don't currently have access to an AWS environment where I can test this unfortunately.
Here's one guess I have: the deferred instantiation of the client means that the client is initialized once per resource, versus once at provider instantiation. This may be a problem if there are many resources (which also require reads to prepare a plan) and the AWS client needs to query resources like the EC2 metadata API (which is rate limited).
@michelzanini @lifeofguenter approximately how many elasticsearch_*
resources are being managed in terraform?
Hi @phillbaker, that makes whole lot of sense. I have around 10 resources more or less. Although you don't have AWS resources to test, you can still probably test this behaviour with debugging?
we also did not have a lot of resources. Maybe around 10 as well.
We heavily monitored IMDS and other rate-limits as this was indeed a general issue, but was not the cause in this case - I think.
I dont think this can be tested easily though...
I would most probably look into how other providers utilize aws-sdk. I do know though especially for signed requests and ES that there are some additional quirks.
I am not actively using this provider anymore else I would invest some time. I think using earlier versions is just fine for most use cases.
I can confirm this has been fixed on 1.5.7.
This may be fixed for aws_assume_role_arn
but transparent role-based authentication via EC2 metadata is broken after 1.5.0 as well. Unfortunately, I need to upgrade because of other bugs that are only fixed in later versions of the provider.
transparent role-based authentication via EC2 metadata is broken after 1.5.0
Hi @marksumm can you clarify exactly the method that's being used here? What environmental variables are set? What EC2 metadata is being used?
@phillbaker I meant a situation where no authentication attributes or environment variables are passed to the provider, healthchecks are disabled, and AWS request signing is enabled. Running locally uses the AWS credentials file as expected, but running on an EC2 instance now hangs indefinitely because state refreshes for resources created using the provider never return. The EC2 instance has an assumed role and so a session token is available via the metadata endpoint. Everything described was working in 1.5.0.
@marksumm please share the elasticsearch provider config that is working on 1.5.0 and not working in more recent versions. What url does the ES cluster have? And is it self hosted or in the AWS Elastic/Opensearch service?
@phillbaker The provider is configured like this...
provider "elasticsearch" {
url = "https://********.us-east-1.es.amazonaws.com"
sign_aws_requests = true
healthcheck = false
}
The endpoint is apparently Elasticsearch 7.7, but it seems that AWS have already started to make changes to the API following the switch to OpenSearch. For example, index patterns should now be nested inside ISM policies and not created as separate resources. By the way, I tried setting AWS_SDK_LOAD_CONFIG=1
, but it didn't help.
@phillbaker I've noticed that if I log in to an affected EC2 instance and target an individual resource created by this provider during terraform plan
(and there are no dependencies on other resources), then the state refresh operation no longer hangs. If I attempt to target more than one resource created by this provider, or run an unmodified terraform plan
, then I see the hanging behaviour as before. This is true even for a configuration with a very small number of resources (3), which seems to point to an internal deadlock, rather than an API limiting issue. Interestingly, setting -parallelism 1
doesn't seem to help.
Hi @marksumm this should be addressed in 64f21df, it'll be released in 2.0.0-beta.2 (coming shortly).
@phillbaker It works! Thank you so much.
After upgrading to
1.5.1
I am getting the following error:It could be related to
aws_assume_role_arn
as I use it on my provider config:It only seems to happen if I use
aws_assume_role_arn
and it does not when I useaws_profile
. I am using Elasticsearch 7.9.Reverting back to
1.5.0
and the error disappears.I see there's significant changes done in this PR https://github.com/phillbaker/terraform-provider-elasticsearch/pull/119, maybe it's related.
Thanks.