Open choldgraf opened 6 years ago
From the official Kubernetes docs:
https://victorops.com/blog/minimum-viable-runbook-part-one also has valuble information. I think these are called 'Runbooks' in sysadmin lingo, and there's lots of literature out there on them.
On Mon, Feb 12, 2018 at 11:16 PM, Carol Willing notifications@github.com wrote:
From the official Kubernetes docs:
- How to debug an application, such as mybinder https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/
- How to debug the kubernetes cluster https://kubernetes.io/docs/tasks/debug-application-cluster/debug-cluster/
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jupyterhub/mybinder.org-deploy/issues/361#issuecomment-365171970, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23n8qk5RA9G6jaR-0DJHCtaatYYx3ks5tUTa-gaJpZM4SC_HG .
-- Yuvi Panda T http://yuvi.in/blog
https://github.com/jupyterhub/mybinder.org-deploy/pull/400 has some info on what are the various things that can go wrong during a deployment and pointers on where to start.
However, almost none of our recent outages have been related to deployments, so we still need more runbooks.
I think it'd be helpful if the SRE guide had a page that had a general structure like:
What do people think about this?
cc @yuvipanda @willingc also maybe @betatim as he's helped put out a few binder fires in his day :-)