Open nomeelnoj opened 4 years ago
Hmm now I am even more confused. I just ran it again after many times of it failing so that I could grab the debug logs, and it is passing. My guess would be that there is some sort of race condition with the policy being available for ES to use when it is created.
One more example--ran terraform apply
using the following resources in the cloudwatch_log_resource_policy document:
resources = [
for k, v in var.log_publishing_options : "arn:aws:logs:us-east-1:${data.aws_caller_identity.current.account_id}:log-group:/aws/aes/${var.domain_name}/${k}:*"
]
and it failed with the same error. however, I then ran terraform destroy -auto-approve && terraform apply -auto-approve
and it worked.
hey @nomeelnoj , can you try adding a depend on for the policy resource in the es resource? i think this should solve your issue. it seems the the es service tries to create before an appropropriate policy is given to the log group.
btw the resource policy with the * at the end is probably because es needs to create log streams inside the log group and as they are dynamically create you will have to add a wildcard in the policy. as long as its in a single log group i would argue that its still secure.
let me know if it works for you.
Hey @DrFaust92 . Originally I had
depends_on = [aws_cloudwatch_log_resource_policy.aes_cloudwatch_log_resource_policy, aws_iam_service_linked_role.es]
inside the ES domain resource, and this issue was still occuring. The first thing I did was remove it, and that actually helped (it passed once, but continued to fail after that).
I will try it again, but unfortunately it is so hard to know since I have now applied and destroyed this infra about 10 times with the same configs and it has created the cluster maybe 3 of those. Since it takes so long to create and delete an ES cluster, each test has the potential to take about an hour to both create and delete.
Okay so I just tested it again adding back in the depends_on, and it seems to be working. In another 30-40 min once i have had a chance to create and destroy, I will try it again. If it passes 3 times in a row I will call this resolved, but it feels VERY strange that it was previously not working with that setting and now it is. Nothing else about the code has changed...
it could be an eventual consistency thing on the es service side and a retry for the specific error you are having is needed here.
btw, i think another issue here is the multiple aws_cloudwatch_log_resource_policy
resource. i think you only need one of these in this instance. the docs in terraform do not denote this but its seems to be a service level resource. and not specific to a single log group.
to simplify resource in the policy i checked to see that the following is a bit cleaner and gives the same result (slight different from the v.arn approach):
resources = [
for log_group in aws_cloudwatch_log_group.es_logs: "${log_group.arn}:*"
]
btw, i think another issue here is the multiple
aws_cloudwatch_log_resource_policy
resource. i think you only need one of these in this instance. the docs in terraform do not denote this but its seems to be a service level resource. and not specific to a single log group.
Yes, i believe this is true, but we want to manage these ES domains 100% self contained, so we would like to create a policy per cluster
I just tried again with the exact same code (including the depends_on
) and it failed this time. I will try changing to your simpler resource policy statement and see if that has anything to do with it.
Even a hack here would work for us--I think it takes AWS a bit to verify the policy is in place. Is there any type of hack I could put in that would force the es cluster to wait even 10 seconds before creating? Thinking about something like:
resource "null_resource" "wait_10" {
depends_on = [aws_cloudwatch_log_resource_group.aes_cloudwatch_log_resource_group]
provisioner "local-exec" {
command = "sleep 10"
interpreter = ["/bin/sh", "-c"]
}
}
Then in the es resource put a depends_on for the null resource above.
UPDATE: it failed again. Going to try the null resource.
try making it 2 mins as its the default used in the provider to wait for IAM propagation. the hack looks good 😄. ill try to add a retry for this case and see if it solves it.
great, thanks! Let me know what you find out--not sure what is going on here because in my tests, if you set the resources
to a much wider value, something like:
resources = [
"arn:aws:logs:*"
]
it always seems to work. I do not know why this would make any difference.
I tried to recreate the issue and was not able to. (with and without depends)
2 things i did differently is i used the region dns for the principal (can be achieved with es.${data.aws_partition.current.dns_suffix}
instead of es.amazonaws.com
)
another thing i did differently AFAIK is the i did this with a single policy resource as i went back to the official docs and found this:
CloudWatch Logs supports 10 resource policies per Region. If you plan to enable slow logs for several Amazon ES domains, you should create and reuse a broader policy that includes multiple log groups to avoid reaching this limit.
It might be easier to have something like arn:aws:logs:*
or maybe a bit more explicit like arn:aws:logs:{region}:{accountnume}:log-group:/aws/aes/{common-prefix}*
. i didnt check a policy with the latter so i'm not sure how accurate it is.
i can add the following code just as another retryable error for es service (i see there are a bunch of them already for other cases)
if isAWSErr(err, "ValidationException", "The Resource Access Policy specified for the CloudWatch Logs log group /aws/aes/example-domain/search does not grant sufficient permissions for Amazon Elasticsearch Service to create a log stream. Please check the Resource Access Policy") {
return resource.RetryableError(err)
}
@anGie44, thoughts?
Problem does not solve yet as i got error below
Error: Error creating ElasticSearch domain: ValidationException: The Resource Access Policy specified for the CloudWatch Logs log group /aws/aes/domains/
Could anyone know this is terraform customize messages or is it actual aws error response ? Does anyone know about terraform code ? It'll be better if we can understand whether it's aws response or not
@totten255 this is an AWS response
Has anyone got the fix here? I am facing the same issue and if I do "arn:aws:logs:*" it works, so don't know what's happening here
This is still happening and using a arn:aws:logs:* seems to work alright but I can't seem to find out the reason. I've tried different dependencies, local waits - nothing helps.
any updates regarding this issue?
Still seeing the same issue with the following policy document despite adding an explicit depends_on
to the aws_opensearch_domain
resource:
data "aws_iam_policy_document" "opensearch_cloudwatch_policy" {
statement {
actions = [
"logs:PutLogEvents",
"logs:PutLogEventsBatch",
"logs:CreateLogStream",
]
principals {
type = "Service"
identifiers = ["es.amazonaws.com"]
}
resources = [
"arn:aws:logs:*",
]
condition {
test = "StringEquals"
variable = "aws:SourceAccount"
values = [
var.aws_account_id,
]
}
condition {
test = "ArnLike"
variable = "aws:SourceArn"
values = [
var.domain_name,
]
}
}
}
Anyone able to help with this please?
Any updates regarding this issue?
same, any updates?
I used this policy when trying to make opensearch cluster,
data "aws_iam_policy_document" "main" {
statement {
effect = "Allow"
principals {
type = "Service"
identifiers = ["es.amazonaws.com"]
}
actions = [
"es:*",
"logs:*"
]
resources = [
"arn:aws:es:eu-west-1:123123123:*",
"arn:aws:logs:*"
]
}
}
But the error always says it has not enough permission to make the cloudwatch log groups - how is it possible, even with full access 🧐 does it see the policy correctly when we point to it with access_policies = data.aws_iam_policy_document.main.json
from "aws_opensearch_domain" resource? Did anyone figure out what's the correct way to use the policies that way? The inline would've worked I guess (the old way). What am I missing?
Maybe I should not combine opensearch rules and cloudwatch rule in same policy? but it does work as long as I am disabling the log groups options in opensearch resource, so policy itself is fine, just that opensearch resource would not accept it, if log groups are enabled. possible bug in the verification when the aws_opensearch_domain runs? false positive imho.
Community Note
Terraform CLI and Terraform AWS Provider Version
Affected Resource(s)
Terraform Configuration Files
Debug Output
Expected Behavior
The module should have run to completion and created the resources, including the es domain, on the first apply.
Actual Behavior
On first apply, Terraform exits with error:
However, if you then run the
terraform apply
again, it passes without issue.In addition, it seems to function properly if you modify the Resource Access Policy to point to a more open set of permissions, something like
"arn:aws:logs:us-east-1:${data.aws_caller_identity.current.account_id}:log-group:*"
allows it to function, but we should be able to lock the policy down more than that.Steps to Reproduce
terraform apply
terraform apply
, runs to completion and creates functioning resources.This makes it extremely difficult to run in CI.
Important Factoids
References
6606 shows a solution, but it requires opening up the policy to all of cloudwatch, instead of just the log groups created for this particular resource.