Open darkk opened 5 years ago
Relapse. Timeline UTC: 14 Feb 22:50 CPU spikes to 100% 15 Feb 08:15 everything recovers
Relapse. Timeline UTC:
2019-05-03T17:29:30Z CPU spikes%20WITHOUT%20(mode%2C%20cpu)&g0.tab=0)
2019-05-04T01:31:00Z alert fires
2019-05-04T07:38:00Z @bassosimone notices and asks for guidance
2019-05-04T09:28:00Z @darkk suggests to search for issues in this repo
2019-05-04T10:14:00Z issue has been found; incident still ongoing
2019-05-04T10:18:00Z @bassosimone reboots the machine; top
is happier
2019-05-04T10:22:00Z alerts are resolved
Impact: 8h 50m downtime of b.echo.th.ooni.io test helper (?)
Detection: CPUHigh alert with expected 8h delay
Timeline UTC: 17 Nov 07:30 CPU spikes to 100%, thats
accept()
vs. EMFILE busy loop 17 Nov 15:34CPUHigh
alert firing 17 Nov 15:48 @darkk logs into the VM, confirms 100% CPU 17 Nov 16:14 @darkk logs into the VM, looks atoonib
, reboots the VM 17 Nov 16:20 everything recovers to normalWhat went well:
What went wrong:
status
was reporting nothing for init script ,reboot
was an "easy" way to restrat the serviceWhat is still unclear:
What could be done to prevent relapse and decrease impact: