aws_elasticsearch_domain fails on initial apply due to aws_cloudwatch_log_resource_policy

nomeelnoj commented 4 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform CLI and Terraform AWS Provider Version

Terraform v0.12.24
+ provider.aws v3.0.0
+ provider.external v1.2.0
+ provider.vault v2.12.2

Affected Resource(s)

aws_elasticsearch_domain
aws_cloudwatch_log_resource_policy

Terraform Configuration Files

data "aws_caller_identity" "current" {}

resource "aws_elasticsearch_domain" "es" {
  domain_name           = var.domain_name
  elasticsearch_version = var.elasticsearch_version

  advanced_options = var.advanced_options

  ebs_options {
    ebs_enabled = var.ebs_volume_size > 0 ? true : false
    volume_size = var.ebs_volume_size
    volume_type = var.ebs_volume_type
    iops        = var.ebs_volume_type == "IOPS" ? var.ebs_iops : null
  }

  encrypt_at_rest {
    enabled    = var.encrypt_at_rest_enabled
    kms_key_id = var.encrypt_at_rest_kms_key_id == "" ? module.kms.arn : var.encrypt_at_rest_kms_key_id
  }

  cluster_config {
    instance_count           = var.instance_count
    instance_type            = var.instance_type
    dedicated_master_enabled = var.dedicated_master_enabled
    dedicated_master_count   = var.dedicated_master_enabled ? var.dedicated_master_count : null
    dedicated_master_type    = var.dedicated_master_enabled ? var.dedicated_master_type : null
    zone_awareness_enabled   = var.zone_awareness_enabled

    zone_awareness_config {
      availability_zone_count = var.zone_awareness_enabled ? var.availability_zone_count : null
    }
  }

  node_to_node_encryption {
    enabled = var.node_to_node_encryption_enabled
  }

  vpc_options {
    security_group_ids = concat(var.security_group_ids, [aws_security_group.elasticsearch_sg.id])
    subnet_ids         = length(var.subnet_ids) > 1 ? slice(var.subnet_ids, 0, var.availability_zone_count) : var.subnet_ids
  }

  snapshot_options {
    automated_snapshot_start_hour = var.automated_snapshot_start_hour
  }

  domain_endpoint_options {
    enforce_https       = var.enforce_https
    tls_security_policy = var.tls_security_policy
  }

  dynamic "cognito_options" {
    for_each = var.cognito_options
    content {
      enabled          = cognito_options.value.enabled
      user_pool_id     = cognito_options.value.user_pool_id
      identity_pool_id = cognito_options.value.identity_pool_id
      role_arn         = cognito_options.value.role_arn
    }
  }

  dynamic "log_publishing_options" {
    for_each = { for k, v in var.log_publishing_options : k => v if lookup(v, "enabled") == true }
    content {
      enabled                  = log_publishing_options.value.enabled
      log_type                 = log_publishing_options.value.log_type
      cloudwatch_log_group_arn = aws_cloudwatch_log_group.es_logs[log_publishing_options.key].arn
    }
  }

  tags = merge(
    var.tags,
    {
      Name    = var.domain_name,
      service = var.service,
      team    = var.team,
      phi     = var.phi
    },
  )

  depends_on = [aws_iam_service_linked_role.es]
}

resource "aws_cloudwatch_log_resource_policy" "aes_cloudwatch_log_resource_policy" {
  count           = length({ for k, v in var.log_publishing_options : k => v if lookup(v, "enabled") == true }) > 0 ? 1 : 0
  policy_name     = "${title(replace(var.domain_name, "-", ""))}-CloudwatchResourcePolicy"
  policy_document = data.aws_iam_policy_document.cloudwatch.json
}

data "aws_iam_policy_document" "cloudwatch" {
  statement {
    actions = [
      "logs:PutLogEvents",
      "logs:PutLogEventsBatch",
      "logs:CreateLogStream",
    ]
    effect = "Allow"
    principals {
      type        = "Service"
      identifiers = ["es.amazonaws.com"]
    }
    resources = [
      # for k, v in aws_cloudwatch_log_group.es_logs : "${v.arn}:*" This never works
      for k, v in var.log_publishing_options : "arn:aws:logs:us-east-1:${data.aws_caller_identity.current.account_id}:log-group:/aws/aes/${var.domain_name}/${k}:*" # this almost never works, but seems to have worked once
      # "arn:aws:logs:us-east-1:${data.aws_caller_identity.current.account_id}:log-group:*" This works 100% of the time based on my tests
    ]
  }
}

resource "aws_cloudwatch_log_group" "es_logs" {
  for_each          = { for k, v in var.log_publishing_options : k => v if lookup(v, "enabled", false) == true }
  name              = "/aws/aes/${var.domain_name}/${each.key}"
  retention_in_days = lookup(each.value, "retention_in_days", 14)

  tags = merge(
    var.tags,
    {
      Name    = "/aws/aes/${var.domain_name}/${each.key}"
      service = var.service,
      team    = var.team,
      phi     = var.phi
    },
  )
}

Debug Output

Expected Behavior

The module should have run to completion and created the resources, including the es domain, on the first apply.

Actual Behavior

On first apply, Terraform exits with error:

Error: Error creating ElasticSearch domain: ValidationException: The Resource Access Policy specified for the CloudWatch Logs log group /aws/aes/example-domain/search does not grant sufficient permissions for Amazon Elasticsearch Service to create a log stream. Please check the Resource Access Policy.

However, if you then run the terraform apply again, it passes without issue.

In addition, it seems to function properly if you modify the Resource Access Policy to point to a more open set of permissions, something like "arn:aws:logs:us-east-1:${data.aws_caller_identity.current.account_id}:log-group:*" allows it to function, but we should be able to lock the policy down more than that.

Steps to Reproduce

terraform apply
Error occurs
terraform apply, runs to completion and creates functioning resources.

This makes it extremely difficult to run in CI.

Important Factoids

References

6606 shows a solution, but it requires opening up the policy to all of cloudwatch, instead of just the log groups created for this particular resource.

nomeelnoj commented 4 years ago

Hmm now I am even more confused. I just ran it again after many times of it failing so that I could grab the debug logs, and it is passing. My guess would be that there is some sort of race condition with the policy being available for ES to use when it is created.

nomeelnoj commented 4 years ago

One more example--ran terraform apply using the following resources in the cloudwatch_log_resource_policy document:

resources = [
  for k, v in var.log_publishing_options : "arn:aws:logs:us-east-1:${data.aws_caller_identity.current.account_id}:log-group:/aws/aes/${var.domain_name}/${k}:*"
]

and it failed with the same error. however, I then ran terraform destroy -auto-approve && terraform apply -auto-approve and it worked.

DrFaust92 commented 4 years ago

hey @nomeelnoj , can you try adding a depend on for the policy resource in the es resource? i think this should solve your issue. it seems the the es service tries to create before an appropropriate policy is given to the log group.

btw the resource policy with the * at the end is probably because es needs to create log streams inside the log group and as they are dynamically create you will have to add a wildcard in the policy. as long as its in a single log group i would argue that its still secure.

let me know if it works for you.

nomeelnoj commented 4 years ago

Hey @DrFaust92 . Originally I had

depends_on = [aws_cloudwatch_log_resource_policy.aes_cloudwatch_log_resource_policy, aws_iam_service_linked_role.es]

inside the ES domain resource, and this issue was still occuring. The first thing I did was remove it, and that actually helped (it passed once, but continued to fail after that).

I will try it again, but unfortunately it is so hard to know since I have now applied and destroyed this infra about 10 times with the same configs and it has created the cluster maybe 3 of those. Since it takes so long to create and delete an ES cluster, each test has the potential to take about an hour to both create and delete.

nomeelnoj commented 4 years ago

Okay so I just tested it again adding back in the depends_on, and it seems to be working. In another 30-40 min once i have had a chance to create and destroy, I will try it again. If it passes 3 times in a row I will call this resolved, but it feels VERY strange that it was previously not working with that setting and now it is. Nothing else about the code has changed...

DrFaust92 commented 4 years ago

it could be an eventual consistency thing on the es service side and a retry for the specific error you are having is needed here.

btw, i think another issue here is the multiple aws_cloudwatch_log_resource_policy resource. i think you only need one of these in this instance. the docs in terraform do not denote this but its seems to be a service level resource. and not specific to a single log group.

see https://stackoverflow.com/questions/48912529/what-resources-does-aws-cloudwatch-log-resource-policy-create

DrFaust92 commented 4 years ago

to simplify resource in the policy i checked to see that the following is a bit cleaner and gives the same result (slight different from the v.arn approach):

resources = [
for log_group in aws_cloudwatch_log_group.es_logs: "${log_group.arn}:*"
]

nomeelnoj commented 4 years ago

btw, i think another issue here is the multiple aws_cloudwatch_log_resource_policy resource. i think you only need one of these in this instance. the docs in terraform do not denote this but its seems to be a service level resource. and not specific to a single log group.

Yes, i believe this is true, but we want to manage these ES domains 100% self contained, so we would like to create a policy per cluster

I just tried again with the exact same code (including the depends_on) and it failed this time. I will try changing to your simpler resource policy statement and see if that has anything to do with it.

Even a hack here would work for us--I think it takes AWS a bit to verify the policy is in place. Is there any type of hack I could put in that would force the es cluster to wait even 10 seconds before creating? Thinking about something like:

resource "null_resource" "wait_10" {
  depends_on = [aws_cloudwatch_log_resource_group.aes_cloudwatch_log_resource_group]

  provisioner "local-exec" {
    command     = "sleep 10"
    interpreter = ["/bin/sh", "-c"]
  }
}

Then in the es resource put a depends_on for the null resource above.

UPDATE: it failed again. Going to try the null resource.

DrFaust92 commented 4 years ago

try making it 2 mins as its the default used in the provider to wait for IAM propagation. the hack looks good 😄. ill try to add a retry for this case and see if it solves it.

nomeelnoj commented 4 years ago

great, thanks! Let me know what you find out--not sure what is going on here because in my tests, if you set the resources to a much wider value, something like:

resources = [
  "arn:aws:logs:*"
]

it always seems to work. I do not know why this would make any difference.

DrFaust92 commented 4 years ago

I tried to recreate the issue and was not able to. (with and without depends)

2 things i did differently is i used the region dns for the principal (can be achieved with es.${data.aws_partition.current.dns_suffix} instead of es.amazonaws.com) another thing i did differently AFAIK is the i did this with a single policy resource as i went back to the official docs and found this:

CloudWatch Logs supports 10 resource policies per Region. If you plan to enable slow logs for several Amazon ES domains, you should create and reuse a broader policy that includes multiple log groups to avoid reaching this limit.

It might be easier to have something like arn:aws:logs:* or maybe a bit more explicit like arn:aws:logs:{region}:{accountnume}:log-group:/aws/aes/{common-prefix}*. i didnt check a policy with the latter so i'm not sure how accurate it is.

i can add the following code just as another retryable error for es service (i see there are a bunch of them already for other cases)

if isAWSErr(err, "ValidationException", "The Resource Access Policy specified for the CloudWatch Logs log group /aws/aes/example-domain/search does not grant sufficient permissions for Amazon Elasticsearch Service to create a log stream. Please check the Resource Access Policy") {
                return resource.RetryableError(err)
}

@anGie44, thoughts?

totten255 commented 4 years ago

Problem does not solve yet as i got error below

Error: Error creating ElasticSearch domain: ValidationException: The Resource Access Policy specified for the CloudWatch Logs log group /aws/aes/domains/ does not grant sufficient permissions for Amazon Elasticsearch Service to create a log stream. Please check the Resource Access Policy.

Could anyone know this is terraform customize messages or is it actual aws error response ? Does anyone know about terraform code ? It'll be better if we can understand whether it's aws response or not

DrFaust92 commented 3 years ago

@totten255 this is an AWS response

mkhaled93 commented 3 years ago

Has anyone got the fix here? I am facing the same issue and if I do "arn:aws:logs:*" it works, so don't know what's happening here

McTristan commented 2 years ago

This is still happening and using a arn:aws:logs:* seems to work alright but I can't seem to find out the reason. I've tried different dependencies, local waits - nothing helps.

aliahmedmytoys commented 1 year ago

any updates regarding this issue?

yassinejaffoo-sanuk commented 1 year ago

Still seeing the same issue with the following policy document despite adding an explicit depends_on to the aws_opensearch_domain resource:

data "aws_iam_policy_document" "opensearch_cloudwatch_policy" {
  statement {
    actions = [
      "logs:PutLogEvents",
      "logs:PutLogEventsBatch",
      "logs:CreateLogStream",
    ]

    principals {
      type        = "Service"
      identifiers = ["es.amazonaws.com"]
    }

    resources = [
      "arn:aws:logs:*",
    ]

    condition {
      test     = "StringEquals"
      variable = "aws:SourceAccount"

      values = [
        var.aws_account_id,
      ]
    }

    condition {
      test     = "ArnLike"
      variable = "aws:SourceArn"

      values = [
        var.domain_name,
      ]
    }
  }
}

Anyone able to help with this please?

rmccarthy-ellevation commented 8 months ago

Any updates regarding this issue?

bitsofinfo commented 6 months ago

same, any updates?

bss-dmitry-shmakov commented 6 months ago

I used this policy when trying to make opensearch cluster,


data "aws_iam_policy_document" "main" {
  statement {
    effect = "Allow"

    principals {
      type        = "Service"
      identifiers = ["es.amazonaws.com"]
    }

    actions   = [
      "es:*",
      "logs:*"
    ]

    resources = [
      "arn:aws:es:eu-west-1:123123123:*",
      "arn:aws:logs:*"
    ]
  }
}

But the error always says it has not enough permission to make the cloudwatch log groups - how is it possible, even with full access 🧐 does it see the policy correctly when we point to it with access_policies = data.aws_iam_policy_document.main.json from "aws_opensearch_domain" resource? Did anyone figure out what's the correct way to use the policies that way? The inline would've worked I guess (the old way). What am I missing?

Maybe I should not combine opensearch rules and cloudwatch rule in same policy? but it does work as long as I am disabling the log groups options in opensearch resource, so policy itself is fine, just that opensearch resource would not accept it, if log groups are enabled. possible bug in the verification when the aws_opensearch_domain runs? false positive imho.

hashicorp / terraform-provider-aws