openculinary / infrastructure

This repository documents the steps required to set up a fresh RecipeRadar environment
GNU Affero General Public License v3.0
5 stars 5 forks source link

Documentation: add an incident report for the 2024-08-10 site outage #42

Closed jayaddison closed 2 months ago

jayaddison commented 2 months ago

Is your feature request related to a problem? Please describe. The RecipeRadar service was unavailable for a significant duration of time from 2024-08-10 until recovery was completed on 2024-08-12. We should add a writeup of the timeline, identified problem cause(s), and suggested adjustments to prevent this happening again.

Describe the solution you'd like An incident report in the docs/incidents directory of this repository.

Describe alternatives you've considered N/A

Additional context N/A

jayaddison commented 2 months ago

Approximate timeline; details to be confirmed where possible:

Various things that went wrong:

What went well:

In terms of root cause: it seems that the fact that the deployed CRI-O container runtime in production was out-of-sync with the AppArmor rules from Ubuntu 24.04 is what ultimately resulted in the problem occurring.

We should encourage and make time for clearing up errors that appear in production -- generally we want the system logs and journals to be fairly clean and minimal, so that problems are easier to identify when they occur. This shouldn't involve egregious filtering of logs; instead it should generally involve fixing and reconfiguring our code and components so that fewer unnecessary info/warning/error level messages are produced.

jayaddison commented 2 months ago

Late July / early August 2024: production hosting upgraded to use Ubuntu 24.04

In terms of root cause: it seems that the fact that the deployed CRI-O container runtime in production was out-of-sync with the AppArmor rules from Ubuntu 24.04 is what ultimately resulted in the problem occurring.

This seems incorrect: the Ubuntu 24.04 upgrade occurred on 2024-08-12 as part of the opportunistic updates during outage recovery.

So it may be the case that an earlier update to the AppArmor rules caused the breakage. This may still be a dependency-version-skew problem (rules and runtime out-of-sync), but the way it was introduced seems less clear.

jayaddison commented 2 months ago

Resolved by #43.