ooni / sysadmin

System administration tools
https://ooni.org
59 stars 26 forks source link

Incident: blocked pipeline on 2019-12-10 #406

Closed FedericoCeratto closed 4 years ago

FedericoCeratto commented 4 years ago

Timeline (times in UTC+1): federico 13:38 oh https://mon.ooni.nu/prometheus/graph?g0.range_input=12h&g0.expr=8%20*%20node_filesystem_avail%20-%207%20*%20(node_filesystem_avail%20OFFSET%201d)%20%3C%200&g0.tab=0 false alarm?

hellais 16:05 Probably related to an increased rate of msmts @federico it actually is true: https://mon.ooni.nu/prometheus/graph?g0.range_input=1d&g0.expr=(8%20*%20node_filesystem_avail%7Binstance%3D%22mia-ps2.ooni.nu%3A9100%22%2Cmountpoint%3D%22%2F%22%7D-%207%20*%20(node_filesystem_avail%7Binstance%3D%22mia-ps2.ooni.nu%3A9100%22%7D%20offset%201d))&g0.tab=0 Hum, according to: https://mon.ooni.nu/grafana/d/AE8tFfxWk/collectors-disk-activity?orgId=1&from=now-2d&to=now The file count on mia-ps2 is going up

18:33 hellais runs ./play deploy-pipeline.yml on top of clean 06b8e01 (master)

Which implements the fix: https://github.com/ooni/sysadmin/commit/ba0392d55973b60edf27849bc9a4489b3a867ccc

The root cause of the issue turned out to be the fact that the newly deployed ams-ps2 collector, did not have all the required directories properly created and hence when the daily cronjob doing the renaming of the files (/srv/collector/bin/daily-tasks.sh) found an empty file it was not able to move it to the correct destination.

What went well:

What went wrong:

What we should do to prevent relapse:

hellais commented 4 years ago

I updated the above comment with a timeline and next steps.

I am closing this incident issue as we have documented the next steps as issues.