User Guide Entry: Pod Resource General Guidelines

sj-williams commented 4 months ago

Background

We need to have a clear and prominent entry in the User Guide which outlines some general considerations and guidelines for workload requests / limits.

This is so that we have reference in place to help ensure we don't invite resource intensive / non scalable monolithic workloads.

Also there are other things related to this that we should put in place / investigate to obtain a good picture of the general state of things in the cluster in this respect. Things to think about:

monitoring / alerting for general state of resource requests/limits across cluster
Can we implement Gatekeeper policies for admission control level of excessive resource configurations, with some permissive controls that can be put in place for 'exceptions' if / when needed?
best practice/K8s reference docs for requests and limits guidance that we can publish in user guide
think about comms for any changes and potential disruption

Proposed user journey

Approach

Which part of the user docs does this impact

Communicate changes

[ ] post for #cloud-platform-update
[ ] Weeknotes item
[ ] Show the Thing/P&A All Hands/User CoP
[ ] Announcements channel

Questions / Assumptions

Definition of done

[ ] readme has been updated
[ ] user docs have been updated
[ ] another team member has reviewed
[ ] smoke tests are green
[ ] prepare demo for the team

Reference

How to write good user stories

timckt commented 3 months ago

Completed the first draft of the user guide.

Next action items:

Create Prometheus alert rules to notify when a pod's resource usage exceeds its limit
Study and implement Gatekeeper policies for admission control level of excessive resource configurations, with some permissive controls that can be put in place for 'exceptions' if / when needed?
- Reference here: https://open-policy-agent.github.io/gatekeeper-library/website/validation/containerlimits/

timckt commented 3 months ago

We can use below command to get the usage snapshot of the cluster in descending order.

# Sort by CPU usage
kubectl top pods --all-namespaces --no-headers | sort -k3 -nr

# Sort by Memory usage
kubectl top pods --all-namespaces --no-headers | sort -k4 -nr

monitoring namespace consumed the most memory.

In user side, data-platform-app-prison-network-app-prod consumed the most memory.

monitoring                                                       prometheus-prometheus-operator-kube-p-prometheus-0                3015m        173512Mi        
monitoring                                                       prometheus-prometheus-operator-kube-p-prometheus-2                8412m        172600Mi        
monitoring                                                       prometheus-prometheus-operator-kube-p-prometheus-1                2978m        170763Mi        
data-platform-app-prison-network-app-prod                        data-platform-app-prison-network-app-prod-54bf644b-gnjnd          9m           7660Mi          
ingress-controllers                                              nginx-ingress-default-controller-5f98cd7f5-ltdlq                  804m         6700Mi          
ingress-controllers                                              nginx-ingress-default-controller-5f98cd7f5-n2gkj                  1148m        6658Mi          
ingress-controllers                                              nginx-ingress-default-controller-5f98cd7f5-nwzv2                  1174m        6273Mi          
ingress-controllers                                              nginx-ingress-default-controller-5f98cd7f5-vdq24                  1128m        6236Mi          
ingress-controllers                                              nginx-ingress-default-controller-5f98cd7f5-52fdx                  1211m        6134Mi          
ingress-controllers                                              nginx-ingress-default-controller-5f98cd7f5-4nkjb                  891m         6058Mi

ministryofjustice / cloud-platform