ministryofjustice / cloud-platform

Documentation on the MoJ cloud platform
MIT License
84 stars 44 forks source link

Configure worker node Image Garbage Collection threshold #5670

Open sj-williams opened 2 months ago

sj-williams commented 2 months ago

Background

We have been responding to higher frequency of root volume capacity high priority alarms in recent months by increasing the size of worker node root volume size.

Given that we know many users applications have large sized container images, and that as our cluster node count increases, we are likely increasing the chance of nodes with a larger amount of cached images existing.

This might be addressed by scaling up our node recycle frequency, or alternatively we could (should) look into the viability of tuning our node garbage collection thresholds. By default the GC kicks in at 85% volume usage, perhaps given the size of some of the much larger container images, we are getting into trouble with getting too close to 100% before cleanup can occur.

Approach

Test editing our worker node kube-config to set the thresholds to a lower value. Guidance here:

https://repost.aws/knowledge-center/eks-worker-nodes-image-cache

Which part of the user docs does this impact

Communicate changes

Questions / Assumptions

Definition of done

Reference

How to write good user stories

kyphutruong commented 1 month ago

At the moment the GC threshold are at the default, which are the following:

    "imageGCHighThresholdPercent": 85,
    "imageGCLowThresholdPercent": 80,

In the kubelet-config on our nodes the values are not specified meaning they take on the above default values.

Need to inject values in the config to set new values

kyphutruong commented 1 month ago

Working branch here

Have ran the branch again a test cluster and can see in the kubelet configuration gets updated with the following GC thresholds:

    "imageGCHighThresholdPercent": 75,
    "imageGCLowThresholdPercent": 70,

The implementation of setting the thresholds involves injecting values into the pre_bootstrap_user_data script, which updates the launch templates. This causes all the nodes to recycle.

Parking the roll out until we have more of the team back from leave.