utoronto-2i2c / jupyterhub-deploy

Demo JupyterHub deployment for University of Toronto
BSD 3-Clause "New" or "Revised" License
3 stars 4 forks source link

2021-02-23 - University of Toronto Hub Partial Outage #87

Open yuvipanda opened 3 years ago

yuvipanda commented 3 years ago

This is a blameless postmortem

Summary

A YAML nesting misconfiguration in the Zero to JupyterHub upgrade (#82) a week prior prevented the user image from being pulled onto nodes where it didn't already exist. Two nodes were active during from before the upgrade, so the image was already present in those. When there were enough users to fill those nodes, a third node was automatically provisioned - but the user image could not be pulled there due to the YAML nesting misconfiguration! This caused server starts to fail once there were just enough users on the hub to fill the two 'working' nodes.

The YAML nesting misconfiguration was fixed, and a subsequent login related issue from a misdeploy was also fixed.

Timeline

All times in IST (+0530)

2021-02-23 08:14 AM

Reports of students not being able to log in start coming in from instructors on the Jupyter Community of Practice room on Microsoft Teams.

image

08:30 AM

Notice is posted on the UToronto system status page, letting users know service is degraded.

06:11 PM

2i2c engineers notice alarm in the Microsoft Teams chat, investigation starts. However, new server starts work now, so a deeper investigation is delayed.

06:49 PM

New server starts are reported broken again.

08:20 PM

Deeper investigation starts again. Since docker image pulls were being denied, this breaking change in z2jh seemed relevant. Looking at the changelog and our current set of config, it looked like there was a YAML nesting error. While the changelog required This meant that all existing nodes at the time of the migration had working authentication but new nodes did not. This meant the first 3 nodes worth of users (~200ish) were able to start properly, but any more triggered a new node which couldn't pull new image due to this issue.

The z2jh PR suggests moving imagePullSecret to the top level. However, we deploy the JupyterHub chart as a helm dependency, so it would have to be nested under a jupyterhub key. This was missed during review and testing of the upgrade PR, since the effect was delayed.

So, yay YAML nesting issues? This was fixed by this PR, and was deployed manually with a local hubploy deploy for expediency.

08:47 PM

Server starts work again, including on new nodes! Yay!

09:23 PM

Reports that new logins are unsuccessful - on clicking login, users are redirected to the same page. Users who are already logged in can start servers.

Upon more investigation, it turns out that the 'expedient' local deploy with hubploy was using a development version of hubploy that was trying to fix this bug, and wasn't fully functional. This left the hub using a combination of the old and new z2jh versions, leading to this strange error.

10:17 PM

All fixed now.

11:19 PM

UofT system status page was updated to mark the incident as resolved.

Things to improve

  1. There should be an automated health checker that informs us of server start failures. This reduces reliance on a human chain of reporting that comes up to 2i2c.
  2. Streamline escalation communication channels - the notification from Microsoft Teams was missed for several hours, but perhaps something from pagerduty coupled with (1) would not have been missed as easily.
  3. Documented process around posting updates on UofT System Status. Avi was very helpful posting statuses this time, but we should document how to quickly communicate to our users during an outage.
  4. Hubploy should provide more useful diagnostic status messages as it goes along, to make issues like this easier to spot.
  5. We should try catch YAML configuration errors - perhaps by testing them against the schema of the z2jh helm chart?

Action items

  1. Figure out how to get messages posted on to systemstatus.utoronto.ca
  2. Work on automated alerts delivered via pagerduty
  3. Use this incident to inform the zero to jupyterhub upgrade process
  4. Fix https://github.com/yuvipanda/hubploy/issues/109
  5. Discuss expected SLA from 2i2c for communications on Microsoft Teams
  6. Discuss expected SLA from 2i2c for incident response
yuvipanda commented 3 years ago

I hope to spend a day gathering input from other 2i2c folks, and fleshing out the action items some more.

choldgraf commented 3 years ago

Thanks @yuvipanda for this helpful write-up.

A few thoughts:

Suggestions: