hashicorp / packer-plugin-amazon

Packer plugin for Amazon AMI Builder
https://www.packer.io/docs/builders/amazon
Mozilla Public License 2.0
73 stars 110 forks source link

amazon-ssm-agent service fails connecting to SSM due to eventual consistency #503

Open gnought opened 2 months ago

gnought commented 2 months ago

Using the below sample config. the temporary_iam_instance_profile_policy_document may not be immediately visible after a EC2 instance starts due to eventual consistency of PutRolePolicy and AddRoleToInstanceProfile. As a result, the amazon-ssm-agent service may fail to connect to SSM because the required SSM role does not available yet. This issue requires logging into the instance to manually restart the service or wait for 30 mins to self heal. (please see the packer log and ec2 amazon-ssm-agent log below)

This PR automatically creates a custom instance profile associated with AmazonSSMManagedInstanceCore managed policy when session_manager is used without specifying iam_instance_profile key attribute. If a user defines temporary_iam_instance_profile_policy_document, it will be added as an inline policy to the custom profile. This will solve the racing condition ensuring the amazon-ssm-agent service could consistently connect to SSM on the first start.

As a bonus, this PR also supports AWS China region, closing https://github.com/hashicorp/packer-plugin-amazon/issues/50

sample config

ssh_interface           = "session_manager"
temporary_key_pair_type = "ed25519"
temporary_key_pair_bits = 384
// copy from AmazonSSMManagedInstanceCore managed policy
temporary_iam_instance_profile_policy_document {
  Version = "2012-10-17"
  Statement {
    Action = [
      "ssm:DescribeAssociation",
      "ssm:GetDeployablePatchSnapshotForInstance",
      "ssm:GetDocument",
      "ssm:DescribeDocument",
      "ssm:GetManifest",
      "ssm:GetParameter",
      "ssm:GetParameters",
      "ssm:ListAssociations",
      "ssm:ListInstanceAssociations",
      "ssm:PutInventory",
      "ssm:PutComplianceItems",
      "ssm:PutConfigurePackageResult",
      "ssm:UpdateAssociationStatus",
      "ssm:UpdateInstanceAssociationStatus",
      "ssm:UpdateInstanceInformation"
    ]
    Effect   = "Allow"
    Resource = ["*"]
  }
  Statement {
    Action = [
      "ssmmessages:CreateControlChannel",
      "ssmmessages:CreateDataChannel",
      "ssmmessages:OpenControlChannel",
      "ssmmessages:OpenDataChannel"
    ]
    Effect   = "Allow"
    Resource = ["*"]
  }
  Statement {
    Action = [
      "ec2messages:AcknowledgeMessage",
      "ec2messages:DeleteMessage",
      "ec2messages:FailMessage",
      "ec2messages:GetEndpoint",
      "ec2messages:GetMessages",
      "ec2messages:SendReply"
    ]
    Effect   = "Allow"
    Resource = ["*"]
  }
}

packer build log:

2024/08/26 00:56:29 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/08/26 00:56:29 Retryable error: TargetNotConnected: i-011a46c740a76676e is not connected.
2024/08/26 00:56:31 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/08/26 00:56:31 [DEBUG] TCP connection to SSH ip/port failed: dial tcp [::1]:8973: connect: connection refused
2024/08/26 00:56:36 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/08/26 00:56:36 [DEBUG] TCP connection to SSH ip/port failed: dial tcp [::1]:8973: connect: connection refused
2024/08/26 00:56:41 packer-plugin-amazon_v1.3.3-dev_x5.0_darwin_arm64 plugin: 2024/08/26 00:56:41 [DEBUG] TCP connection to SSH ip/port failed: dial tcp [::1]:8973: connect: connection refused

The ec2 amazon-ssm-agent log:

status code: 404, request id:
2024-08-25 16:54:21 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
    status code: 400, request id: 906a00a0-9eec-42b7-b385-xxxxxxxxx
2024-08-25 16:54:21 ERROR [CredentialRefresher] Retrieve credentials produced error: no valid credentials could be retrieved for ec2 identity. Default Host Management Err: error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 1234567890
    status code: 400, request id: 906a00a0-9eec-42b7-b385-xxxxxxxxx
2024-08-25 16:54:21 INFO [CredentialRefresher] Sleeping for 27m6s before retrying retrieve credentials