subspace / infra

7 stars 4 forks source link

add ephemeral runners #302

Open DaMandal0rian opened 3 months ago

DaMandal0rian commented 3 months ago

User description

Terraform module Self-Hosted Scalable GitHub Actions runners on AWS based on the official Philips AWS Module. https://github.com/philips-labs/terraform-aws-github-runner under MIT License.

This PR removes some features and modules not required for our infra use case and makes use of more granular modules.

Scale up and down based on GitHub events Scale down to zero when no jobs are running Runners are created on-demand and terminated after use (ephemeral runners) Runners are created on spot instances use custom AMI, define the instance types and subnets to use. OS support: Linux (x64/arm64) and Windows CI workflow for deployment


Type

enhancement


Description


Changes walkthrough

Relevant files
Configuration changes
variables.tf
Add Variables for Configuring AWS, GitHub App, Runners, and Lambdas

github-runners/terraform/autoscaling/variables.tf
  • Added a comprehensive list of variables for configuring AWS region,
    VPC, subnets, tags, GitHub app parameters, runner settings, and more.
  • Introduced variables for configuring ephemeral runners, including
    enabling ephemeral runners, job queued check, and managed runner
    security group.
  • Added variables for configuring lambda functions, including memory
    size, timeout, and S3 bucket details for lambda functions.
  • Included variables for AMI configuration, instance types, and runner
    OS support.
  • +886/-0 
    main.tf
    Configure AWS Resources, Runners, and Lambda Functions     

    github-runners/terraform/autoscaling/main.tf
  • Setup for AWS resources including SQS queues, SSM parameters, webhook
    module, and runners module.
  • Configuration for ephemeral runners and pool management.
  • Defined resources for logging, IAM roles, and lambda functions.
  • +394/-0 
    main.tf
    Setup AWS Launch Template and Security Groups for Runners

    github-runners/terraform/autoscaling/modules/runners/main.tf
  • Defined AWS launch template for runners with configurations for AMI,
    instance type, security groups, and user data.
  • Setup security groups for runners with egress rules.
  • Configuration for IAM roles and policies for runners.
  • +229/-0 
    variables.tf
    Add Variables for Runner Binaries Syncer Lambda Configuration

    github-runners/terraform/autoscaling/modules/runner-binaries-syncer/variables.tf
  • Introduced variables for configuring the runner binaries syncer lambda
    function.
  • Variables include configurations for S3 bucket, lambda function
    details, and logging.
  • +238/-0 
    runner-binaries-syncer.tf
    Configure Runner Binaries Syncer Lambda and Trigger           

    github-runners/terraform/autoscaling/modules/runner-binaries-syncer/runner-binaries-syncer.tf
  • Setup for runner binaries syncer lambda function including IAM role,
    policies, and cloudwatch event rule.
  • Configuration for S3 object trigger to execute the lambda after
    deployment.
  • +191/-0 
    main.tf
    Setup Lambda Function for Runner Pool Management                 

    github-runners/terraform/autoscaling/modules/runners/pool/main.tf
  • Defined AWS lambda function for managing runner pools.
  • Setup IAM roles and policies for the lambda function.
  • Configuration for cloudwatch event rules to trigger the lambda based
    on schedule.
  • +179/-0 
    variables.tf
    Add Variables for Termination Watcher Lambda Configuration

    github-runners/terraform/autoscaling/modules/termination-watcher/variables.tf
  • Introduced variables for configuring the termination watcher lambda
    function.
  • Variables include configurations for lambda function details, logging,
    and environment variables.
  • +65/-0   

    PR-Agent usage: Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    github-actions[bot] commented 3 months ago

    PR Description updated to latest commit (https://github.com/subspace/infra/commit/2656d95a4cf69b300e27d9326e77f80b78f19ac0)

    github-actions[bot] commented 3 months ago

    PR Review

    ⏱️ Estimated effort to review [1-5] 5, due to the extensive number of files and changes involved in this PR, including multiple Terraform modules, Packer configurations, and Lambda function implementations. The complexity and interdependencies between these components significantly increase the review effort required to ensure correctness, security, and adherence to best practices.
    🧪 Relevant tests No
    🔍 Possible issues Possible Security Concern: The PR introduces numerous AWS resources and Lambda functions. Without proper IAM policies and security group configurations, there might be potential security risks, such as overly permissive access or unintended exposure of resources.
    Possible Performance Issue: The use of ephemeral runners and on-demand scaling based on GitHub events can lead to unpredictable costs and performance implications under high load or misconfiguration.
    Dependency Management: The PR includes external dependencies (e.g., GitHub Actions runner binaries, Docker, AWS CLI). Changes in these dependencies could impact the stability and functionality of the setup.
    🔒 Security concerns - IAM Policies: Ensure that IAM roles and policies assigned to Lambda functions and EC2 instances follow the principle of least privilege to minimize security risks. - Public Exposure: Verify that security groups and network access controls are correctly configured to prevent unauthorized access to resources.
    Code feedback:
    relevant filegithub-runners/terraform/autoscaling/modules/runners/main.tf
    suggestion       Consider adding lifecycle hooks to the `aws_launch_template` resource to manage updates and replacements more gracefully. This can help in ensuring that instances are only replaced according to the defined policies, minimizing disruptions. [important]
    relevant lineresource "aws_launch_template" "runner" {

    relevant filegithub-runners/terraform/autoscaling/modules/runner-binaries-syncer/runner-binaries-syncer.tf
    suggestion       Implement error handling and retry mechanisms in the Lambda function responsible for syncing runner binaries. This will improve the robustness of the setup by handling transient errors that may occur during the download or update process. [important]
    relevant lineresource "aws_lambda_function" "syncer" {

    relevant filegithub-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy/github_agent.ubuntu.pkr.hcl
    suggestion       Optimize the Packer build by using a more specific source AMI filter to reduce build times and potential costs. Narrowing down the search to a more recent base AMI can also ensure that the runners are built on a more secure and updated OS. [medium]
    relevant linesource_ami_filter {

    relevant filegithub-runners/terraform/autoscaling/modules/runners/pool/main.tf
    suggestion       Add monitoring and alerting for the Lambda function that adjusts the runner pool size. Monitoring metrics such as invocation errors, duration, and throttles can help in identifying issues early and maintaining the desired state of runner pools. [medium]
    relevant lineresource "aws_lambda_function" "pool" {


    ✨ Review tool usage guide:
    **Overview:** The `review` tool scans the PR code changes, and generates a PR review which includes several types of feedbacks, such as possible PR issues, security threats and relevant test in the PR. More feedbacks can be [added](https://pr-agent-docs.codium.ai/tools/review/#general-configurations) by configuring the tool. The tool can be triggered [automatically](https://pr-agent-docs.codium.ai/usage-guide/automations_and_usage/#github-app-automatic-tools-when-a-new-pr-is-opened) every time a new PR is opened, or can be invoked manually by commenting on any PR. - When commenting, to edit [configurations](https://github.com/Codium-ai/pr-agent/blob/main/pr_agent/settings/configuration.toml#L23) related to the review tool (`pr_reviewer` section), use the following template: ``` /review --pr_reviewer.some_config1=... --pr_reviewer.some_config2=... ``` - With a [configuration file](https://pr-agent-docs.codium.ai/usage-guide/configuration_options/), use the following template: ``` [pr_reviewer] some_config1=... some_config2=... ``` See the review [usage page](https://pr-agent-docs.codium.ai/tools/review/) for a comprehensive guide on using this tool.
    github-actions[bot] commented 3 months ago

    PR Code Suggestions

    CategorySuggestions                                                                                                                                                       
    Enhancement
    Use dynamic blocks for tags to improve code maintainability. ___ **Consider using a dynamic block for tags to improve maintainability and readability. This
    approach allows you to easily add or remove tags without modifying the structure of each
    resource definition.** [github-runners/terraform/autoscaling/gh-runners/ephemeral/main.tf [25-27]](https://github.com/subspace/infra/pull/302/files#diff-8d5361c04f1c28f11368d39d585dd8e835295a11af3965041d9ed5de301a4319R25-R27) ```diff -tags = { - Project = "subspace-scale-runners" +dynamic "tags" { + for_each = { + Project = "subspace-scale-runners" + # Add more tags here + } + content { + key = tags.key + value = tags.value + } } ```
    Change associate_public_ip_address variable type to boolean for clarity. ___ **For the associate_public_ip_address variable, it's more appropriate to use a boolean type
    instead of a string since the description suggests a true/false value. This change
    improves clarity and enforces correct input values.** [github-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy-arm64/github_agent.ubuntu.pkr.hcl [33-37]](https://github.com/subspace/infra/pull/302/files#diff-36f4c64d06ada4a3d8018594aa00af8ddb264389da8c726d97dcfdd31d54be87R33-R37) ```diff variable "associate_public_ip_address" { description = "If using a non-default VPC, there is no public IP address assigned to the EC2 instance. If you specified a public subnet, you probably want to set this to true. Otherwise the EC2 instance won't have access to the internet" - type = string + type = bool default = null } ```
    Use multiple subnets for Lambda functions for higher availability. ___ **To ensure high availability and fault tolerance, consider specifying multiple subnet IDs
    for the Lambda functions by using a list of subnets rather than a single subnet. This
    allows the Lambda functions to be deployed across multiple Availability Zones.** [github-runners/terraform/autoscaling/main.tf [245]](https://github.com/subspace/infra/pull/302/files#diff-69de1c5f1c6562547995b664c547cd9ebd9460c2ad1c756cb4c2f9482bdf73feR245-R245) ```diff -lambda_subnet_ids = var.lambda_subnet_ids +lambda_subnet_ids = var.lambda_subnet_ids_list ```
    Add a condition to ensure s3_bucket, s3_key, and s3_object_version are only set when necessary. ___ **To ensure that the lambda function only uses the S3 bucket when necessary, consider adding
    a condition to check both var.lambda_s3_bucket and var.lambda_s3_key are not null before
    setting s3_bucket, s3_key, and s3_object_version. This will make the configuration more
    robust and avoid potential misconfigurations.** [github-runners/terraform/autoscaling/modules/ami-housekeeper/main.tf [7-9]](https://github.com/subspace/infra/pull/302/files#diff-1a780d161895dc12d02dbf91b2df5df854fb2ffe2a469dfb7b8d4933313cbeeeR7-R9) ```diff -s3_bucket = var.lambda_s3_bucket != null ? var.lambda_s3_bucket : null +s3_bucket = var.lambda_s3_bucket != null && var.lambda_s3_key != null ? var.lambda_s3_bucket : null s3_key = var.lambda_s3_key != null ? var.lambda_s3_key : null -s3_object_version = var.lambda_s3_object_version != null ? var.lambda_s3_object_version : null +s3_object_version = var.lambda_s3_object_version != null && var.lambda_s3_bucket != null && var.lambda_s3_key != null ? var.lambda_s3_object_version : null ```
    Best practice
    Parameterize instance types for flexibility. ___ **To ensure the scalability and flexibility of your Terraform configuration, consider
    parameterizing the instance_types to allow different instance types for different
    environments or use cases.** [github-runners/terraform/autoscaling/gh-runners/ephemeral/main.tf [49]](https://github.com/subspace/infra/pull/302/files#diff-8d5361c04f1c28f11368d39d585dd8e835295a11af3965041d9ed5de301a4319R49-R49) ```diff -instance_types = ["m6a.4xlarge", "c6a.4xlarge"] +instance_types = var.instance_types ```
    Avoid hardcoding AWS region for security and flexibility. ___ **To avoid potential security risks, consider removing the hardcoded AWS region and instead
    use a variable. This will also increase the flexibility of your Terraform configuration.** [github-runners/terraform/autoscaling/gh-runners/ephemeral/main.tf [3]](https://github.com/subspace/infra/pull/302/files#diff-8d5361c04f1c28f11368d39d585dd8e835295a11af3965041d9ed5de301a4319R3-R3) ```diff -aws_region = "us-west-2" +aws_region = var.aws_region ```
    Use a more specific version constraint for the amazon plugin. ___ **It's recommended to use a more specific version constraint for the amazon plugin to avoid
    potential compatibility issues with future versions. Instead of ">= 0.0.2", consider
    specifying a range that caps the major version, such as "~> 0.0.2" which allows patch
    updates but prevents major version changes.** [github-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy-arm64/github_agent.ubuntu.pkr.hcl [3-5]](https://github.com/subspace/infra/pull/302/files#diff-36f4c64d06ada4a3d8018594aa00af8ddb264389da8c726d97dcfdd31d54be87R3-R5) ```diff amazon = { - version = ">= 0.0.2" + version = "~> 0.0.2" source = "github.com/hashicorp/amazon" } ```
    Use the null coalescing operator for default values to improve code readability. ___ **For better maintainability and readability, consider using Terraform's null coalescing
    operator instead of ternary operators for default values. This operator simplifies the
    expression and makes the code cleaner.** [github-runners/terraform/autoscaling/modules/ami-housekeeper/main.tf [2-3]](https://github.com/subspace/infra/pull/302/files#diff-1a780d161895dc12d02dbf91b2df5df854fb2ffe2a469dfb7b8d4933313cbeeeR2-R3) ```diff -lambda_zip = var.lambda_zip == null ? "${path.module}/../../lambdas/functions/ami-housekeeper/ami-housekeeper.zip" : var.lambda_zip -role_path = var.role_path == null ? "/${var.prefix}/" : var.role_path +lambda_zip = var.lambda_zip ?? "${path.module}/../../lambdas/functions/ami-housekeeper/ami-housekeeper.zip" +role_path = var.role_path ?? "/${var.prefix}/" ```
    Maintainability
    Use a local variable for webhook_secret to reduce duplication. ___ **For better maintainability and to avoid duplication, consider using a local variable or a
    module output for common values such as webhook_secret instead of repeating the
    random_id.random.hex in multiple places.** [github-runners/terraform/autoscaling/gh-runners/ephemeral/main.tf [32]](https://github.com/subspace/infra/pull/302/files#diff-8d5361c04f1c28f11368d39d585dd8e835295a11af3965041d9ed5de301a4319R32-R32) ```diff -webhook_secret = random_id.random.hex +webhook_secret = local.webhook_secret ```
    Move variable definitions to a separate file for better maintainability. ___ **To improve the maintainability and readability of the code, consider using a separate
    variables file for defining variables. This helps in separating the configuration from the
    template logic and makes the template cleaner.** [github-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy-arm64/github_agent.ubuntu.pkr.hcl [10-12]](https://github.com/subspace/infra/pull/302/files#diff-36f4c64d06ada4a3d8018594aa00af8ddb264389da8c726d97dcfdd31d54be87R10-R12) ```diff -variable "runner_version" { - description = "The version (no v prefix) of the runner software to install https://github.com/actions/runner/releases. The latest release will be fetched from GitHub if not provided." - default = null -} +# Variables should be defined in a separate file, e.g., variables.tf ```
    Use a variable for Terraform required version. ___ **To improve maintainability and avoid hardcoding, consider using a Terraform variable for
    the required_version of Terraform in versions.tf. This allows for easier updates and
    version management.** [github-runners/terraform/autoscaling/main.tf [2]](https://github.com/subspace/infra/pull/302/files#diff-69de1c5f1c6562547995b664c547cd9ebd9460c2ad1c756cb4c2f9482bdf73feR2-R2) ```diff -required_version = ">= 1" +required_version = var.terraform_required_version ```
    Use a separate file for IAM policies and roles to improve maintainability. ___ **To improve the maintainability of your Terraform code, consider using a separate file for
    IAM policies and roles. This approach makes it easier to manage and review IAM policies,
    especially as they grow in complexity.** [github-runners/terraform/autoscaling/modules/ami-housekeeper/main.tf [57-63]](https://github.com/subspace/infra/pull/302/files#diff-1a780d161895dc12d02dbf91b2df5df854fb2ffe2a469dfb7b8d4933313cbeeeR57-R63) ```diff +# In a separate file, e.g., iam.tf resource "aws_iam_role" "ami_housekeeper" { name = "${var.prefix}-ami-housekeeper-role" assume_role_policy = data.aws_iam_policy_document.lambda_assume_role_policy.json path = local.role_path permissions_boundary = var.role_permissions_boundary tags = var.tags } ```
    Security
    Review security practices when using public IP addresses. ___ **For better security practices, avoid using public IP addresses if not necessary. If
    associate_public_ip_address is set to true, ensure that the instances are secured and only
    necessary ports are exposed.** [github-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy-arm64/github_agent.ubuntu.pkr.hcl [104]](https://github.com/subspace/infra/pull/302/files#diff-36f4c64d06ada4a3d8018594aa00af8ddb264389da8c726d97dcfdd31d54be87R104-R104) ```diff -associate_public_ip_address = var.associate_public_ip_address +# Ensure security measures are in place if using public IP addresses ```
    Dynamically fetch AMI names using AWS SSM Parameter Store for enhanced security. ___ **To ensure the EC2 instances are using the latest and most secure AMIs, consider
    dynamically fetching the source_ami_filter name using AWS SSM Parameter Store instead of
    hardcoding the AMI name pattern.** [github-runners/terraform/autoscaling/gh-runners/images/ubuntu-jammy-arm64/github_agent.ubuntu.pkr.hcl [107-115]](https://github.com/subspace/infra/pull/302/files#diff-36f4c64d06ada4a3d8018594aa00af8ddb264389da8c726d97dcfdd31d54be87R107-R115) ```diff source_ami_filter { filters = { - name = "*ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-arm64-server-*" + name = "{{ data.aws_ssm_parameter.ubuntu_ami.value }}" root-device-type = "ebs" virtualization-type = "hvm" } most_recent = true owners = ["099720109477"] } ```
    Restrict SQS queue policy actions to only necessary ones. ___ **Consider using a more restrictive condition for the SQS queue policy to limit the actions
    to only what's necessary, rather than allowing all SQS actions ("sqs:*"). This follows the
    principle of least privilege, enhancing the security of your Terraform configuration.** [github-runners/terraform/autoscaling/main.tf [33-35]](https://github.com/subspace/infra/pull/302/files#diff-69de1c5f1c6562547995b664c547cd9ebd9460c2ad1c756cb4c2f9482bdf73feR33-R35) ```diff actions = [ - "sqs:*" + "sqs:SendMessage", + "sqs:ReceiveMessage", + "sqs:DeleteMessage", + "sqs:GetQueueAttributes" ] ```
    Enable server-side encryption on SQS queues by default. ___ **For better security, consider enabling server-side encryption on the SQS queues by
    default, unless there's a specific reason to disable it. This can be done by setting
    sqs_managed_sse_enabled to true in the SQS resource configurations.** [github-runners/terraform/autoscaling/main.tf [73]](https://github.com/subspace/infra/pull/302/files#diff-69de1c5f1c6562547995b664c547cd9ebd9460c2ad1c756cb4c2f9482bdf73feR73-R73) ```diff -sqs_managed_sse_enabled = var.queue_encryption.sqs_managed_sse_enabled +sqs_managed_sse_enabled = true ```
    Specify specific ARNs in IAM policy resources instead of using a wildcard. ___ **It's recommended to avoid using a broad wildcard ("*") for IAM policy resources when
    possible. Specify the ARNs of the resources that the policy should apply to, enhancing the
    security by following the principle of least privilege.** [github-runners/terraform/autoscaling/main.tf [38-39]](https://github.com/subspace/infra/pull/302/files#diff-69de1c5f1c6562547995b664c547cd9ebd9460c2ad1c756cb4c2f9482bdf73feR38-R39) ```diff resources = [ - "*" + "arn:aws:sqs:region:account-id:queue-name" ] ```
    Specify a more restrictive assume_role_policy for enhanced security. ___ **To enhance security, consider specifying a more restrictive assume_role_policy for the
    aws_iam_role resource. The current policy allows any Lambda function to assume this role.
    You should restrict it to only the Lambda functions that need it.** [github-runners/terraform/autoscaling/modules/ami-housekeeper/main.tf [67-73]](https://github.com/subspace/infra/pull/302/files#diff-1a780d161895dc12d02dbf91b2df5df854fb2ffe2a469dfb7b8d4933313cbeeeR67-R73) ```diff statement { actions = ["sts:AssumeRole"] principals { type = "Service" identifiers = ["lambda.amazonaws.com"] } + condition { + StringEquals = { + "aws:SourceArn": aws_lambda_function.ami_housekeeper.arn + } + } } ```

    ✨ Improve tool usage guide:
    **Overview:** The `improve` tool scans the PR code changes, and automatically generates suggestions for improving the PR code. The tool can be triggered [automatically](https://pr-agent-docs.codium.ai/usage-guide/automations_and_usage/#github-app-automatic-tools-when-a-new-pr-is-opened) every time a new PR is opened, or can be invoked manually by commenting on a PR. - When commenting, to edit [configurations](https://github.com/Codium-ai/pr-agent/blob/main/pr_agent/settings/configuration.toml#L78) related to the improve tool (`pr_code_suggestions` section), use the following template: ``` /improve --pr_code_suggestions.some_config1=... --pr_code_suggestions.some_config2=... ``` - With a [configuration file](https://pr-agent-docs.codium.ai/usage-guide/configuration_options/), use the following template: ``` [pr_code_suggestions] some_config1=... some_config2=... ``` See the improve [usage page](https://pr-agent-docs.codium.ai/tools/improve/) for a comprehensive guide on using this tool.