ministryofjustice / modernisation-platform

A place for the core work of the Modernisation Platform • This repository is defined and managed in Terraform
https://user-guide.modernisation-platform.service.justice.gov.uk
MIT License
680 stars 290 forks source link

SPIKE: Patching ECS/EKS nodes #2413

Closed davidkelliott closed 3 months ago

davidkelliott commented 1 year ago

User Story

As a modernisation platform engineer I want customers to use the most recent AMIs with their clusters So that they are using up-to-date software

User Type(s)

Analytical Platform users Data Platform users Performance Monitoring Other potential platform customers on MP

Value

Where ECS or EKS use EC2 instances, we need to ensure that they are using the latest recommended versions. We will start with investigating how we find out the latest versions and make users aware of this, then how we make these upgrades at a platform level if needed.

Questions / Assumptions / Hypothesis

Has this already been covered with the new ECS module raised after this issue was created? If so, is it just a question of migrating legacy users across?

Proposal

This story is about finding out where customers are not making use of up-to-date AMI images for ECS/EKS - for example, where they're hard coding the AMI rather than retrieving the latest version with a data call. It's a bit more free-form than that because this is a spike, but that's my interpretation.

Definition of done

Reference

How to write good user stories

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 90 days with no activity.

dms1981 commented 4 months ago

Following some discussion, we think this story is around the use of up-to-date AMI images for ECS/EKS containers.

richgreen-moj commented 3 months ago

I've gone through the code in Modernisation Platform Environments and created a spreadsheet to document the use of EKS/ECS, making note of where hardcoded ami values are being used.

ECS

EKS

richgreen-moj commented 3 months ago

Here's a blog with some template code for automating the update of EC2 instances in an auto scaling group that is hosting ECS services https://aws.amazon.com/blogs/industries/automate-patching-by-replacing-amazon-ecs-container-instances/ Essentially it looks up the latest version of the ECS-optimised AMI for your desired platform and then updates the launch template with the new value. Care is taken to drain nodes and take them offline one by one to avoid downtime.

richgreen-moj commented 3 months ago

Retrieving latest AMIs:

ECS

The ECS TF module uses a data call to retrieve the latest ECS-optimised AMI image by querying the Systems Manager Parameter Store API. https://github.com/terraform-aws-modules/terraform-aws-ecs/blob/master/examples/ec2-autoscaling/main.tf#L162C3-L165

This is then used to describe the image id for the ECS auto scaling group https://github.com/terraform-aws-modules/terraform-aws-ecs/blob/master/examples/ec2-autoscaling/main.tf#L296

Members could make use of this module or build this in to their code, rather than hard-coding AMI IDs.

Or via SSM parameter store: aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2023/recommended --region eu-west-2

EKS

The EKS TF Module can be used with a data call to get, for instance, the latest bottlerocket EKS-optimised image: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/examples/eks_managed_node_group/main.tf#L527-L535

Or via SSM parameter store: aws ssm get-parameter --name /aws/service/bottlerocket/aws-k8s-1.30/x86_64/latest/image_id --region eu-west-2 --query "Parameter.Value" --output text

richgreen-moj commented 3 months ago

Based on my findings of usage of ECS and EKS in across the MP here is a list of options that members could consider to ensure their infrastructure is patched with the latest AMIs:

Options

  1. Use Fargate (serverless) approach so that instance patch management is managed by AWS. (Use the MP module for this)
  1. Use a Terraform data call to retrieve the latest ECS/EKS-optimised AMI image by querying the Systems Manager Parameter Store API (e.g. https://github.com/terraform-aws-modules/terraform-aws-ecs/blob/master/examples/ec2-autoscaling/main.tf#L162C3-L165)
  1. Reconsider whether workloads would be appropriate for Cloud Platform

    • Pros:
    • Easier to maintain for application owners (CP mange patching)
    • Lower costs?
    • Cons:
    • There may still be valid reasons why these workloads need to be hosted in MP
  2. Make users aware of the latest AMIs as they are released via an updates channel in Slack?

My Recommendation:

Raise a ticket to explore whether options 1/2/3 would be suitable for all of the applications I've identified who are running ECS/EKS with pinned AMI IDs in their code...

richgreen-moj commented 3 months ago

@sukeshreddyg suggested that we could write a lambda script that scans the AMIs in use by clusters in member accounts and compares that with the latest versions so that we can alert MP team when they are out of date. I will draft a story to explore this further.

richgreen-moj commented 3 months ago

Stories to write: