Incident: blocked pipeline on 2019-12-10

Timeline (times in UTC+1): federico 13:38 oh https://mon.ooni.nu/prometheus/graph?g0.range_input=12h&g0.expr=8%20*%20node_filesystem_avail%20-%207%20*%20(node_filesystem_avail%20OFFSET%201d)%20%3C%200&g0.tab=0 false alarm?

hellais 16:05 Probably related to an increased rate of msmts @federico it actually is true: https://mon.ooni.nu/prometheus/graph?g0.range_input=1d&g0.expr=(8%20*%20node_filesystem_avail%7Binstance%3D%22mia-ps2.ooni.nu%3A9100%22%2Cmountpoint%3D%22%2F%22%7D-%207%20*%20(node_filesystem_avail%7Binstance%3D%22mia-ps2.ooni.nu%3A9100%22%7D%20offset%201d))&g0.tab=0 Hum, according to: https://mon.ooni.nu/grafana/d/AE8tFfxWk/collectors-disk-activity?orgId=1&from=now-2d&to=now The file count on mia-ps2 is going up

18:33 hellais runs ./play deploy-pipeline.yml on top of clean 06b8e01 (master)

Which implements the fix: https://github.com/ooni/sysadmin/commit/ba0392d55973b60edf27849bc9a4489b3a867ccc

The root cause of the issue turned out to be the fact that the newly deployed ams-ps2 collector, did not have all the required directories properly created and hence when the daily cronjob doing the renaming of the files (/srv/collector/bin/daily-tasks.sh) found an empty file it was not able to move it to the correct destination.

What went well:

Alerting was useful and allowed us to spot the issue

What went wrong:

The incident actually had been going on since the 3rd of December (when this collector was deployed) and we did not notice until the disk space problem

What we should do to prevent relapse:

Add monitoring on the number of report files on the collector to ensure the rsync is working (https://github.com/ooni/sysadmin/issues/408)
Get rid of the sketchy daily cronjob and upload measurements as soon as possible elsewhere (https://github.com/ooni/pipeline/issues/268)

ooni / sysadmin

Incident: blocked pipeline on 2019-12-10 #406