CPU pressure - Shared resources guide

tumido commented 3 years ago

Create a guide how to be respectful and mindful of shared resources and what CPU request and CPU limit means. We're in a constant state of CPU pressure, while the utilization is never above 20%.

Explain to users what does the CPU request mean and what does CPU limit mean.
Explain to users that it doesn't make any difference if they set CPU request above 1 if their application is not multi-core.
Create a dashboard where the they can check their Kubeflow pipeline/jupyterhub pod actual usage compared to CPU request
Create an alert when the actual usage is way above the actual usage

tumido commented 3 years ago

Based on investigation:

Dashboard: https://grafana-route-opf-monitoring.apps.zero.massopen.cloud/d/GGgK4Q9Mk/tcoufal-testing-total-cpu-usage-vs-request?orgId=1
CPU request is much higher than any workload we run
CPU tiers are too generous in JH
Why does the openshift-storage consume so much
Completed and Failed pods resource requests are really not considered by the scheduler, despite it appears that deleting them makes the cluster unstuck (further investigation needed)
There's conflicting metrics:
- kube_pod_container_resource_limits_cpu_cores shows the Pod's CPU request as it is defined in the resource
- kube_pod_resource_request shows the effective Pod's resource request that is considered by the scheduler
Prioritize https://github.com/operate-first/SRE/issues/229

New user-focused dashboards (convert to declarative and provide upstream):

oindrillac commented 3 years ago

@tumido I get a 403 permission denied on these dashboards as i try to login through moc-sso. Can you please provide access?

HumairAK commented 3 years ago

currently only people with get access to opf-monitoring can access grafana (i.e. people in the operate-first ocp group), due to the change here -- we should expand these to additional users, (subject to ongoing discussions regarding user policies of course)

tumido commented 3 years ago

@HumairAK can we extend that to data-science group then? :slightly_smiling_face:

oindrillac commented 3 years ago

thanks! if these are the user oriented dashboards which @tumido presented, we(data science users) would very much benefit from having access to these 🙂

tumido commented 3 years ago

yes, these are the user oriented dashboards (the two linked at the end) - we'll convert them to permanent dashboards soon. I still have some fiddling to do. Right now, they are just prototypes, stored in the Grafana runtime..

sesheta commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

HumairAK commented 3 years ago

/remove-lifecycle stale