ministryofjustice / cloud-platform

Documentation on the MoJ cloud platform
MIT License
87 stars 44 forks source link

Recreate RShiny app scenario in test cluster to investigate CPU Critical #5251

Open sj-williams opened 9 months ago

sj-williams commented 9 months ago

Background

Find out in what scenarios a pod can increase underlying node CPU usage.

RShiny problems were tracked down to liveness probe hitting an endpoint regularly that opened a new session which then never closed. Can we recreate this scenario with an app?

What we want to recreate here is a node's CPU becoming critical and breaking workloads on the node, and then k8s services failing (like metrics server, calico).

DOD

Link to notes for Rshiny app issues: https://docs.google.com/document/d/1qAxCYFzDQta00l4v3IZ1CUyjOECuWAXDh3oAqNF0UtA/edit

Proposed user journey

Approach

Which part of the user docs does this impact

Communicate changes

Questions / Assumptions

Definition of done

Reference

How to write good user stories

tom-j-smith commented 8 months ago

Further diagnosis of the R Shiny app issue here provided by Glenn Christmas https://github.com/ministryofjustice/data-platform-support/issues/429

sj-williams commented 6 months ago

For reference, follow up monitoring work link to legacy ticket: https://github.com/ministryofjustice/cloud-platform/issues/4538