terraform-aws-modules / terraform-aws-emr

Terraform module to create AWS EMR resources 🇺🇦
https://registry.terraform.io/modules/terraform-aws-modules/emr/aws
Apache License 2.0
23 stars 19 forks source link

Error deploying EMR due to `insufficient ec2 permissions #3

Closed oc-stephen-bennett closed 1 year ago

oc-stephen-bennett commented 1 year ago

Description

When doing a deployment via the example it generates an error with:

 Error: waiting for EMR Cluster (j-1P38LJGZQ23DK) to create: unexpected state 'TERMINATED_WITH_ERRORS', wanted target 'RUNNING, WAITING'. last error: VALIDATION_ERROR: Service role arn:aws:iam::xxx:role/oc-dev-data-science-emr-service-20230505161755695800000001 has insufficient EC2 permissions
│ 
│   with module.oc-aws-data-science.module.emr_instance_fleet.aws_emr_cluster.this[0],
│   on .terraform/modules/oc-aws-data-science.emr_instance_fleet/main.tf line 26, in resource "aws_emr_cluster" "this":
│   26: resource "aws_emr_cluster" "this" {

terraform code used:

module "emr" {
  source  = "terraform-aws-modules/emr/aws"
  version = "1.0.0"
  name    = "${local.full_name}-emr"

  release_label_filters = {
    emr6 = {
      prefix = "emr-6"
    }
  }
  applications = ["spark"]
  auto_termination_policy = {
    idle_timeout = 3600
  }

  bootstrap_action = {
    example = {
      name = "Just an example",
      path = "file:/bin/echo",
      args = ["Hello World!"]
    }
  }

  configurations_json = jsonencode([
    {
      "Classification" : "spark-env",
      "Configurations" : [
        {
          "Classification" : "export",
          "Properties" : {
            "JAVA_HOME" : "/usr/lib/jvm/java-1.8.0"
          }
        }
      ],
      "Properties" : {}
    }
  ])

  master_instance_group = {
    name           = "master-group"
    instance_count = 1
    instance_type  = "m5.xlarge"
  }

  core_instance_group = {
    name           = "core-group"
    instance_count = 2
    instance_type  = "c4.large"
  }

  task_instance_group = {
    name           = "task-group"
    instance_count = 2
    instance_type  = "c5.xlarge"
    bid_price      = "0.1"

    ebs_config = {
      size                 = 64
      type                 = "gp3"
      volumes_per_instance = 1
    }
    ebs_optimized = true
  }

  ebs_root_volume_size = 64
  ec2_attributes = {
    subnet_id = data.aws_subnets.intra.ids[0]
  }
  vpc_id = data.aws_vpc.this.id

  keep_job_flow_alive_when_no_steps = true
  list_steps_states                 = ["PENDING", "RUNNING", "CANCEL_PENDING", "CANCELLED", "FAILED", "INTERRUPTED", "COMPLETED"]
  log_uri                           = "s3://${var.s3_prevent_destroy == true ? aws_s3_bucket.oc-aws-data-science[0].id : aws_s3_bucket.oc-aws-data-science-destroy[0].id}/emr-logs/"

  scale_down_behavior    = "TERMINATE_AT_TASK_COMPLETION"
  step_concurrency_level = 3
  termination_protection = false
  visible_to_all_users   = true

}

versions.tf

terraform {
  required_version = ">= 1.1.5"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.64"
    }
  }
}
asafpelegcodes commented 1 year ago

I'm also encountering this issue as well

ChewingGlass commented 1 year ago

I'm also encountering this error

asafpelegcodes commented 1 year ago

After doing some debugging worked around the issue with the following amendment to the example...


data "aws_iam_policy_document" "create_in_network" {
  statement {
    sid     = "CreateInNetwork"
    actions = [
                "ec2:CreateNetworkInterface",
                "ec2:RunInstances",
                "ec2:CreateFleet",
                "ec2:CreateLaunchTemplate",
                "ec2:CreateLaunchTemplateVersion"
              ]

    resources = ["arn:aws:ec2:*:*:subnet/${PRIVATE_SUBNET_YOUR_EMR_CLUSTER_IS_USING}"]
  }
}

resource "aws_iam_policy" "emr_create_in_network" {

  name        = "emr_create_in_network"
  description = "extra policy for EMR cluster setup"

  policy = data.aws_iam_policy_document.create_in_network.json
}

module "emr" {
  source = "terraform-aws-modules/emr/aws"
  version = "1.0.0"
  ...

  ec2_attributes = {
    # Instance groups only support one Subnet/AZ
    # Subnets should be private subnets and tagged with
    # { "for-use-with-amazon-emr-managed-policies" = true }
    subnet_id = PRIVATE_SUBNET_YOUR_EMR_CLUSTER_IS_USING
  }
  ...

 service_iam_role_policies = {
   "AmazonEMRServicePolicy_v2": "arn:aws:iam::aws:policy/service-role/AmazonEMRServicePolicy_v2", # THIS IS THE DEFAULT VALUE FOR THIS ATTRIBUTE
   "CreatInNetwork": aws_iam_policy.emr_create_in_network.arn # THIS FIXES THE CLUSTER FAILURE
  }
bryantbiggs commented 1 year ago

I suspect this is related to the v2 managed policies - without a full reproduction it will be difficult to tell though.

Have you all enabled the appropriate tag on the subnets used/passed to EMR? https://github.com/terraform-aws-modules/terraform-aws-emr/blob/d987b8d45038f8424896aa68e632f7570a19bdc0/examples/private-cluster/main.tf#L265-L270

See main README just before Usage: image

bryantbiggs commented 1 year ago

anyone able to confirm if the above guidance solves their permission issues?

bodkeyogesh commented 1 year ago

Hello @bryantbiggs, you are correct. After applying the tags to Private Subnet, I was able to solve the insufficient EC2 permissions issue.

Thanks.

bryantbiggs commented 1 year ago

Any suggestions on how to better surface this in the docs? I'm open to ideas

oc-stephen-bennett commented 1 year ago

yes i finally put 2 and 2 together to see that it was that policy, I would question the merit of adding the tagging condition

bryantbiggs commented 1 year ago

To be clear, this is coming from Amazon and how they have scoped permissions. In this module I have tagged all the relevant resources accordingly but I cannot ensure the appropriate networking resources are tagged based on the intended architecture since those are outside of this module

bryantbiggs commented 1 year ago

closing this for now - please feel free to provide feedback on how we can better improve the documentation to make this functionality more clear to users in the future

github-actions[bot] commented 1 year ago

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.