Closed jayaddison closed 2 months ago
Approximate timeline; details to be confirmed where possible:
Various things that went wrong:
What went well:
In terms of root cause: it seems that the fact that the deployed CRI-O container runtime in production was out-of-sync with the AppArmor rules from Ubuntu 24.04 is what ultimately resulted in the problem occurring.
We should encourage and make time for clearing up errors that appear in production -- generally we want the system logs and journals to be fairly clean and minimal, so that problems are easier to identify when they occur. This shouldn't involve egregious filtering of logs; instead it should generally involve fixing and reconfiguring our code and components so that fewer unnecessary info/warning/error level messages are produced.
Late July / early August 2024: production hosting upgraded to use Ubuntu 24.04
In terms of root cause: it seems that the fact that the deployed CRI-O container runtime in production was out-of-sync with the AppArmor rules from Ubuntu 24.04 is what ultimately resulted in the problem occurring.
This seems incorrect: the Ubuntu 24.04 upgrade occurred on 2024-08-12 as part of the opportunistic updates during outage recovery.
So it may be the case that an earlier update to the AppArmor rules caused the breakage. This may still be a dependency-version-skew problem (rules and runtime out-of-sync), but the way it was introduced seems less clear.
Resolved by #43.
Is your feature request related to a problem? Please describe. The RecipeRadar service was unavailable for a significant duration of time from 2024-08-10 until recovery was completed on 2024-08-12. We should add a writeup of the timeline, identified problem cause(s), and suggested adjustments to prevent this happening again.
Describe the solution you'd like An incident report in the
docs/incidents
directory of this repository.Describe alternatives you've considered N/A
Additional context N/A