ooni / sysadmin

System administration tools
https://ooni.org
59 stars 26 forks source link

b.echo.th.ooni.io possibly down for 8 hours #244

Open darkk opened 5 years ago

darkk commented 5 years ago

Impact: 8h 50m downtime of b.echo.th.ooni.io test helper (?)

Detection: CPUHigh alert with expected 8h delay

Timeline UTC: 17 Nov 07:30 CPU spikes to 100%, thats accept() vs. EMFILE busy loop 17 Nov 15:34 CPUHigh alert firing 17 Nov 15:48 @darkk logs into the VM, confirms 100% CPU 17 Nov 16:14 @darkk logs into the VM, looks at oonib, reboots the VM 17 Nov 16:20 everything recovers to normal

What went well:

What went wrong:

What is still unclear:

What could be done to prevent relapse and decrease impact:

darkk commented 5 years ago

Relapse. Timeline UTC: 14 Feb 22:50 CPU spikes to 100% 15 Feb 08:15 everything recovers

bassosimone commented 5 years ago

Relapse. Timeline UTC:

2019-05-03T17:29:30Z CPU spikes%20WITHOUT%20(mode%2C%20cpu)&g0.tab=0) 2019-05-04T01:31:00Z alert fires 2019-05-04T07:38:00Z @bassosimone notices and asks for guidance 2019-05-04T09:28:00Z @darkk suggests to search for issues in this repo 2019-05-04T10:14:00Z issue has been found; incident still ongoing 2019-05-04T10:18:00Z @bassosimone reboots the machine; top is happier 2019-05-04T10:22:00Z alerts are resolved