The root cause of the issue turned out to be the fact that the newly deployed ams-ps2 collector, did not have all the required directories properly created and hence when the daily cronjob doing the renaming of the files (/srv/collector/bin/daily-tasks.sh) found an empty file it was not able to move it to the correct destination.
What went well:
Alerting was useful and allowed us to spot the issue
What went wrong:
The incident actually had been going on since the 3rd of December (when this collector was deployed) and we did not notice until the disk space problem
Timeline (times in UTC+1): federico 13:38 oh https://mon.ooni.nu/prometheus/graph?g0.range_input=12h&g0.expr=8%20*%20node_filesystem_avail%20-%207%20*%20(node_filesystem_avail%20OFFSET%201d)%20%3C%200&g0.tab=0 false alarm?
hellais 16:05 Probably related to an increased rate of msmts @federico it actually is true: https://mon.ooni.nu/prometheus/graph?g0.range_input=1d&g0.expr=(8%20*%20node_filesystem_avail%7Binstance%3D%22mia-ps2.ooni.nu%3A9100%22%2Cmountpoint%3D%22%2F%22%7D-%207%20*%20(node_filesystem_avail%7Binstance%3D%22mia-ps2.ooni.nu%3A9100%22%7D%20offset%201d))&g0.tab=0 Hum, according to: https://mon.ooni.nu/grafana/d/AE8tFfxWk/collectors-disk-activity?orgId=1&from=now-2d&to=now The file count on mia-ps2 is going up
18:33 hellais runs ./play deploy-pipeline.yml on top of clean 06b8e01 (master)
Which implements the fix: https://github.com/ooni/sysadmin/commit/ba0392d55973b60edf27849bc9a4489b3a867ccc
The root cause of the issue turned out to be the fact that the newly deployed ams-ps2 collector, did not have all the required directories properly created and hence when the daily cronjob doing the renaming of the files (
/srv/collector/bin/daily-tasks.sh
) found an empty file it was not able to move it to the correct destination.What went well:
What went wrong:
What we should do to prevent relapse: