jupyterhub / mybinder.org-deploy

Deployment config files for mybinder.org
https://mybinder-sre.readthedocs.io/en/latest/index.html
BSD 3-Clause "New" or "Revised" License
76 stars 74 forks source link

Create a "so binder isn't operating properly, what do you do now?" guide #361

Open choldgraf opened 6 years ago

choldgraf commented 6 years ago

I think it'd be helpful if the SRE guide had a page that had a general structure like:

# So Binder seems to be broken, what now?

## Where to look for more information
### Where to look on Grafana
useful graphs etc

### Things to try
building a repo, noting where it breaks down.

## Common commands to debug
* e.g., `kubectl logs`

and link out to the kubernetes debugging section of z2jh

## Things you can try to fix things
* e.g., `kubectl delete` the `hub` pod

here put things that will generally not *make things worse*, but that might solve the problem.

## When to escalate the problem
Information on when to bring this up to the team as a whole, and only after a few common fixes have been tried. We can have the whole binder team go into red-alert any time there's an outage

## What to do if the problem still isn't resolved?
Instructions for e.g., creating an issue that points out the problem and things you've tried to fix it thus far, so that the dev team can get to it when they have the time.

## What to do after the problem is resolved?
Instructions on incident reports etc.

What do people think about this?

cc @yuvipanda @willingc also maybe @betatim as he's helped put out a few binder fires in his day :-)

willingc commented 6 years ago

From the official Kubernetes docs:

yuvipanda commented 6 years ago

https://victorops.com/blog/minimum-viable-runbook-part-one also has valuble information. I think these are called 'Runbooks' in sysadmin lingo, and there's lots of literature out there on them.

On Mon, Feb 12, 2018 at 11:16 PM, Carol Willing notifications@github.com wrote:

From the official Kubernetes docs:

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jupyterhub/mybinder.org-deploy/issues/361#issuecomment-365171970, or mute the thread https://github.com/notifications/unsubscribe-auth/AAB23n8qk5RA9G6jaR-0DJHCtaatYYx3ks5tUTa-gaJpZM4SC_HG .

-- Yuvi Panda T http://yuvi.in/blog

yuvipanda commented 6 years ago

https://github.com/jupyterhub/mybinder.org-deploy/pull/400 has some info on what are the various things that can go wrong during a deployment and pointers on where to start.

However, almost none of our recent outages have been related to deployments, so we still need more runbooks.