ministryofjustice / cloud-platform

Documentation on the MoJ cloud platform
MIT License
87 stars 45 forks source link

Concourse cleanup job: for errored jobs #5672

Open jaskaransarkaria opened 5 months ago

jaskaransarkaria commented 5 months ago

Background

Errored jobs seem to stick around and slows down nodes being drained. Write a small bash script to delete these jobs in errored states across the cluster (nightly or weekly) eg.

│ prisoner-content-hub-production                prisoner-content-hub-backend-db-backup-cronjob-28617360-vk65q    ●  0/1   Completed                          0   0    0      0      0 │
│ pathfinder-preprod                             notifications-28617270-xql4w                                     ●  0/1   ContainerStatusUnknown             0   0    0      0      0 │
│ pathfinder-preprod                             push-extract-28617165-vznn7                                      ●  0/1   ContainerStatusUnknown             0   0    0      0      0 │
│ track-a-move-prod                              track-a-move-prod-report-28559645-dfvl2                          ●  0/1   CreateContainerConfigError         0   0    0      0      0 │
│ court-probation-dev                            analytics-data-extractor-28617180-x7hrm                          ●  0/1   Error                              0   0    0      0      0 │
│ court-probation-preprod                        analytics-data-extractor-28617180-htjq9                          ●  0/1   Error                              0   0    0      0      0 │
│ hmpps-incentives-dev                           queue-housekeeping-cronjob-28617460-lnf2q                        ●  0/1   Error                              0   0    0      0      0 │
│ hmpps-incentives-preprod                       queue-housekeeping-cronjob-28617440-m5zpw                        ●  0/1   Error                              0   0    0      0      0 │
│ pathfinder-dev                                 moj-data-platform-extractor-28617240-6cdvw                       ●  0/1   Error                              0   0    0      0      0 │
│ pathfinder-dev                                 moj-data-platform-extractor-28617240-6jmpj                       ●  0/1   Error                              0   0    0      0      0 │
│ pathfinder-dev                                 moj-data-platform-extractor-28617240-g4wqb                       ●  0/1   Error                              0   0    0      0      0 │
│ pathfinder-dev                                 moj-data-platform-extractor-28617240-pwkb8                       ●  0/1   Error                              0   0    0      0      0 │
│ pathfinder-dev                                 push-extract-28617165-z9wsh                                      ●  0/1   Error                              0   0    0      0      0

Definition of done

Reference

How to write good user stories

jaskaransarkaria commented 5 months ago

https://github.com/kubernetes-sigs/descheduler/tree/master?tab=readme-ov-file#removefailedpods

poornima-krishnasamy commented 5 months ago

This job deletes all successful pods: https://github.com/ministryofjustice/cloud-platform-environments/blob/main/bin/delete_completed_jobs.rb. SO can we add the failed jobs as well?

jaskaransarkaria commented 5 months ago

Probably should convert this ☝🏽 script to bash or go too

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 3 months ago

This issue was closed because it has been stalled for 7 days with no activity.

jackstockley89 commented 2 weeks ago

can we create a gatekeeper rule that requires the following parameters to be in place for cronjobs to clear jobs

spec:
    schedule: "*/1 * * * *"
    successfulJobsHistoryLimit: 0
    failedJobsHistoryLimit: 0
jaskaransarkaria commented 2 days ago

From duplicate ticket

these pods hog ips and prevent a node from being drained, clean them up to help keep the cluster in a good state.

https://mojdt.slack.com/archives/C514ETYJX/p1724835696695569

Approach

Create a new maintenance job that runs nightly and cleans up each of our clusters, the code below might be useful, you can swap parallel for xargs if preferred:

kubectl get pods --field-selector="status.phase=Failed,spec" -A --no-headers | awk '{print $2 " -n " $1}' | parallel -j1 --will-cite kubectl delete pod "{= uq =}"

we should consider:

treating errored and completed jobs differently (we need to make sure we aren't blasting genuine errored jobs so users have time to fix the errors)
treating prod and non-prod differently