Impact: Measurements dropped from 63,123,792 in October 2023, to 48,336,322 measurements in November 2023.
Detection: @agrabeli noticed this while pulling together the November monthly report
Timeline EST
3:48AM: Maria noticed measurement drop as part of doing monthly report
4:16AM: Simone starts investigating using Explorer
Time?: Jessie comes online and starts looking at timing of releases
Time?: Jessie and Simone pull in Federico for help
Around 9:30am: Norbel pulled into help
Around 1:22pm: discover source of the problem
1:40pm Patch to fix
What would have helped us catch this bug earlier?
No debate it is a bug, it was working before and then not working as intended after
Maybe in the end we want to change how the system works but it wasn’t intentional so we can still consider it a bug
Code review did not spot this change, there could have been a more thorough job of code review.
Could automated testing have caught this? Current tests don’t consider this specific scenario
Need to improve the tests we have so far - write more tests to capture more scenarios
More emphasis on making probes understandable by devs, not just users, so it is easier for us to de-bug, when there are issue
What would have helped us catch the drop in measurements earlier? Not ideal that it was caught in monthly report process?
Daily or weekly team measurement reviews to go through notebooks
Need a solution for a weekly measurement review in Federico’s absence
Scheduled releases + monitoring as a team - so people know who is doing the release, who is doing the monitoring, etc. Who ever is doing release + monitoring knows what they are responsible
Not a super explicit guideline of who does monitoring
Release owner - someone who follows the release and pokes other people if there are issues and people aren’t responding
We had a notebook that was monitoring measurements and sent an alarm - confirmed that the alert was broken. Grafana was updated and the syntax used in the notebook was broken - can fix this specific notebook, but maybe need to review other alerts across more notebooks.
As part of fixing this issue with notebooks, could send alerts to alert manager and slack and have an alert history so we can see a history of alarms and what is active.
This alarm is experimental, so there isn’t an official procedure to decide what is a working alert and not
Prepare jupyter notebooks for future use = set of templates that people know they can use as templates
Define severities and do emergency response training
This could also help make it more clear when an alert needs action or not
Also add a definition of tiers in the backend depending on severities
On call rota?
Do more regular post-mortem to understand the root causes
What might have helped make the investigation process easier?
Emergency response training
Depending on the event, there could be a call leader/incident leader who brings people in
More transparency about what the app is doing in the background- it was not easy. We were slower because it was difficult to gather information about what the app was doing
Why is it hard to know what is going on?
Not displaying results from background runs
Add more backend metrics and more alarms
Alert on median test duration
Alert on median number of URLs in each report
Alert on percentage of probes on battery VS charging
Alert on percentage of probes on wifi VS LTE
Don't have a way to tell app to use test backend
Had to ask federico to deploy a patch in production to help us
Would have been longer for someone with less backend experience to do this
Would be nice to tell app ‘use test backend’
[wishful thinking ahead] Doing fault injection on the testbed backend and test probes to create failures and see how they behave and check if alerts are triggered
There is no easy way to integration test master in probe-cli w/ master in probe-android so it’s a bit annoying that we only discover issues after a probe-cli release [a more general comment than this bug, but still we have this fundamental issue]
Look for new / upcoming monitoring tools that use AI/ML to perform outlier/anomaly detection automatically
Doing smaller / faster releases? <- yes!
We’re trying!
Next Steps
Jessie Bonisteel to discuss with Arturo how to handle measurement reviewing in Federico’s absence
Simone Basso figure out with Norbel Ambanumben how to have better cross integration testing between the engine and the android app
Simone Basso and Norbel Ambanumben make a wish list of how to make the app easier to debug once installed such that we can do a better job at doing QA
Federico Ceratto TODO fix specifically for the alert about long term predictor that didn’t fire
TODO Add this stuff to the backend docs
TODO with Simone Basso define severities and tiers
Impact: Measurements dropped from 63,123,792 in October 2023, to 48,336,322 measurements in November 2023.
Detection: @agrabeli noticed this while pulling together the November monthly report
Timeline EST 3:48AM: Maria noticed measurement drop as part of doing monthly report 4:16AM: Simone starts investigating using Explorer Time?: Jessie comes online and starts looking at timing of releases Time?: Jessie and Simone pull in Federico for help Around 9:30am: Norbel pulled into help Around 1:22pm: discover source of the problem 1:40pm Patch to fix
What would have helped us catch this bug earlier?
What would have helped us catch the drop in measurements earlier? Not ideal that it was caught in monthly report process?
What might have helped make the investigation process easier?
Next Steps