pulibrary / ops-catchall

Operations Catch All
0 stars 0 forks source link

CheckMK: lae-staging2 disk issues #110

Closed acozine closed 1 month ago

acozine commented 1 month ago

We get regular alerts about disk usage on this machine. A quick look shows that the syslog is enormous and it's not getting rotated regularly:

pulsys@lae-staging2:~$ sudo du -h --max-depth=1 /var/log
4.1G    /var/log/journal
148K    /var/log/apt
72K /var/log/unattended-upgrades
4.0K    /var/log/dist-upgrade
1.1M    /var/log/installer
4.0K    /var/log/landscape
4.0K    /var/log/private
56K /var/log/redis
105M    /var/log/nginx
43G /var/log

and

pulsys@lae-staging2:/var/log$ ls -lah syslog*
-rw-r----- 1 syslog adm  20G Oct  3 22:00 syslog
-rw-r----- 1 syslog adm  18G Sep 29 00:00 syslog.1
-rw-r----- 1 syslog adm 507K Sep 22 00:00 syslog.2.gz
-rw-r----- 1 syslog adm 462K Sep 15 00:00 syslog.3.gz
-rw-r----- 1 syslog adm 456K Sep  8 00:00 syslog.4.gz

We should see if we can figure out why the syslog is so chatty, and also rotate the file more frequently.

acozine commented 1 month ago

As a stopgap, I manually edited /etc/logrotate.d/rsyslog and changed the frequency from weekly to daily. But we should fix this in a way that will persist if we rebuild the machines. And also figure out what the root of the problem is.

acozine commented 1 month ago

@tpendragon deleted the LAE staging queue, which must have gotten created with different parameters in the past, vs what is currently set up in the client. This stopped the flow of errors.