google / stenographer

Stenographer is a packet capture solution which aims to quickly spool all packets to disk, then provide simple, fast access to subsets of those packets. Discussion/announcements at stenographer@googlegroups.com
Apache License 2.0
1.79k stars 238 forks source link

cleanup deletes too many files #189

Closed xoh closed 6 years ago

xoh commented 6 years ago

When the disk becomes full (as in free space is below DiskFreePercentage), too many files get deleted. The issue is tedious to reproduce, as the disk has to fill up. Irregularly monitoring used disk space, I've noticed it drops from 95% to below 80%, sometimes even lower. As far as I understand, only the last packet-file should be deleted (filesizes on my setup range from 20MB to 350MB), to provide the largest possible capture-timeframe.

I tried to trigger debug outputs like https://github.com/google/stenographer/blob/6f293479207d3f2ffd51424854cac8a149124dc5/thread/thread.go#L204 or https://github.com/google/stenographer/blob/6f293479207d3f2ffd51424854cac8a149124dc5/thread/thread.go#L229 but nothing appears in syslog/journalctl. Setting higher verbosity either failes for me:

Setting "Flags": ["-v=10"] in /etc/stenographer/config triggers /usr/bin/stenotype: invalid option -- '=' on start of stenographer

or results in way more log messages, but none maching those of cleanUpOnLowDiskSpace() or pruneOldestThreadFiles() in stenographer/thread/thread.go:

Setting "Flags": ["-vvvvvvvvvv"] in /etc/stenographer/config leads to tons of logs, mostly "BlockDone", "New block" and "BlockReset"

Almost all logs are from .cc-Files, a few from index.h, some log stenoread requests. Not a single logline of journalctl and /var/log/syslog corresponds to the log-calls in stenographer/thread/thread.go.

I've now setup a dirty monitoring to show the issue:

Timestamp                      Filesystem      Size  Used Avail Use% Mounted on
Wed May 23 11:27:31 CEST 2018: /dev/sdb1        50G   44G  3,0G  94% /mnt/steno_drive
Wed May 23 11:27:46 CEST 2018: /dev/sdb1        50G   43G  3,6G  93% /mnt/steno_drive

Wed May 23 11:32:31 CEST 2018: /dev/sdb1        50G   44G  3,0G  94% /mnt/steno_drive
Wed May 23 11:32:46 CEST 2018: /dev/sdb1        50G   44G  3,5G  93% /mnt/steno_drive

Wed May 23 11:35:31 CEST 2018: /dev/sdb1        50G   44G  3,0G  94% /mnt/steno_drive
Wed May 23 11:35:46 CEST 2018: /dev/sdb1        50G   36G   11G  78% /mnt/steno_drive

Wed May 23 11:36:01 CEST 2018: /dev/sdb1        50G   36G   11G  78% /mnt/steno_drive
Wed May 23 11:36:16 CEST 2018: /dev/sdb1        50G   36G   11G  77% /mnt/steno_drive

Wed May 23 12:28:46 CEST 2018: /dev/sdb1        50G   44G  3,0G  94% /mnt/steno_drive
Wed May 23 12:29:01 CEST 2018: /dev/sdb1        50G   43G  3,8G  92% /mnt/steno_drive

Wed May 23 12:37:01 CEST 2018: /dev/sdb1        50G   44G  3,0G  94% /mnt/steno_drive
Wed May 23 12:37:16 CEST 2018: /dev/sdb1        50G   23G   24G  50% /mnt/steno_drive

Wed May 23 14:28:31 CEST 2018: /dev/sdb1        50G   44G  3,0G  94% /mnt/steno_drive
Wed May 23 14:28:46 CEST 2018: /dev/sdb1        50G   30G   18G  64% /mnt/steno_drive

The third, sixth and seventh cleanup delete too many files.

gconnell commented 6 years ago

Out of curiosity, was https://github.com/google/stenographer/commit/6f293479207d3f2ffd51424854cac8a149124dc5 compiled in when you saw this behavior? It was designed to fix this exact issue.

xoh commented 6 years ago

Actually, no. I compiled the latest version. Getting back after some cleanups have passed.

xoh commented 6 years ago

This actually was the issue. No cleanups larger than 2% any more. Thanks!

gconnell commented 6 years ago

Awesome, thanks for checking back in and closing things out! And thanks to fming188 for pointing out the root cause in #187 :)