Measurement Drop Incident - December 12, 2023

Impact: Measurements dropped from 63,123,792 in October 2023, to 48,336,322 measurements in November 2023.

Detection: @agrabeli noticed this while pulling together the November monthly report

Timeline EST 3:48AM: Maria noticed measurement drop as part of doing monthly report 4:16AM: Simone starts investigating using Explorer Time?: Jessie comes online and starts looking at timing of releases Time?: Jessie and Simone pull in Federico for help Around 9:30am: Norbel pulled into help Around 1:22pm: discover source of the problem 1:40pm Patch to fix

What would have helped us catch this bug earlier?

No debate it is a bug, it was working before and then not working as intended after
- Maybe in the end we want to change how the system works but it wasn’t intentional so we can still consider it a bug
Code review did not spot this change, there could have been a more thorough job of code review.
Could automated testing have caught this? Current tests don’t consider this specific scenario
Need to improve the tests we have so far - write more tests to capture more scenarios
More emphasis on making probes understandable by devs, not just users, so it is easier for us to de-bug, when there are issue

What would have helped us catch the drop in measurements earlier? Not ideal that it was caught in monthly report process?

Daily or weekly team measurement reviews to go through notebooks
- Need a solution for a weekly measurement review in Federico’s absence
Scheduled releases + monitoring as a team - so people know who is doing the release, who is doing the monitoring, etc. Who ever is doing release + monitoring knows what they are responsible
Not a super explicit guideline of who does monitoring
Release owner - someone who follows the release and pokes other people if there are issues and people aren’t responding
We had a notebook that was monitoring measurements and sent an alarm - confirmed that the alert was broken. Grafana was updated and the syntax used in the notebook was broken - can fix this specific notebook, but maybe need to review other alerts across more notebooks.
As part of fixing this issue with notebooks, could send alerts to alert manager and slack and have an alert history so we can see a history of alarms and what is active.
This alarm is experimental, so there isn’t an official procedure to decide what is a working alert and not
Prepare jupyter notebooks for future use = set of templates that people know they can use as templates
Define severities and do emergency response training
This could also help make it more clear when an alert needs action or not
Also add a definition of tiers in the backend depending on severities
On call rota?
Do more regular post-mortem to understand the root causes

What might have helped make the investigation process easier?

Emergency response training
- Depending on the event, there could be a call leader/incident leader who brings people in
More transparency about what the app is doing in the background- it was not easy. We were slower because it was difficult to gather information about what the app was doing
- Why is it hard to know what is going on?
- Not displaying results from background runs
Add more backend metrics and more alarms
- Alert on median test duration
- Alert on median number of URLs in each report
- Alert on percentage of probes on battery VS charging
- Alert on percentage of probes on wifi VS LTE
Don't have a way to tell app to use test backend
- Had to ask federico to deploy a patch in production to help us
- Would have been longer for someone with less backend experience to do this
- Would be nice to tell app ‘use test backend’
[wishful thinking ahead] Doing fault injection on the testbed backend and test probes to create failures and see how they behave and check if alerts are triggered
There is no easy way to integration test master in probe-cli w/ master in probe-android so it’s a bit annoying that we only discover issues after a probe-cli release [a more general comment than this bug, but still we have this fundamental issue]
Look for new / upcoming monitoring tools that use AI/ML to perform outlier/anomaly detection automatically
Doing smaller / faster releases? <- yes!
- We’re trying!

Next Steps

Jessie Bonisteel to discuss with Arturo how to handle measurement reviewing in Federico’s absence
Simone Basso figure out with Norbel Ambanumben how to have better cross integration testing between the engine and the android app
Simone Basso and Norbel Ambanumben make a wish list of how to make the app easier to debug once installed such that we can do a better job at doing QA
Federico Ceratto TODO fix specifically for the alert about long term predictor that didn’t fire
- TODO Add this stuff to the backend docs
- TODO with Simone Basso define severities and tiers
Federico Ceratto and Simone Basso to “pair” on adding these alerts:
- Add more backend metrics and more alarms
- Alert on median test duration
- Alert on median number of URLs in each report
- Alert on percentage of probes on battery VS charging
- Alert on percentage of probes on wifi VS LTE
To discuss: future process for making sure alerts keep working :)
To discuss: how to improve monitoring and alerting as part of FOSS (maybe opportunities to improve response process as part of that?)

ooni / backend

Measurement Drop Incident - December 12, 2023 #783