vergoh / vnstat

vnStat - a network traffic monitor for Linux and BSD
GNU General Public License v2.0
1.36k stars 120 forks source link

do not exit with non-zero exit code when using `--alert` flag and `Failed to open database "/var/lib/vnstat/vnstat.db" in read-only mode.` throws #259

Open n0099 opened 2 months ago

n0099 commented 2 months ago

I'm using this plain bash as a fuse of monthly data limit:

#!/bin/bash
#set -x
set -e

if ! vnstat --alert 2 3 m rx 500 GiB -i eth0; then
    iptables -A OUTPUT -p udp --sport 443 -j DROP
    iptables -A OUTPUT -p tcp --sport 443 -j DROP
    iptables -A OUTPUT -p tcp --sport 80 -j DROP
    exit 1
fi

but when the system is under high load, vnstat may rarely encountering https://github.com/vergoh/vnstat/issues/129

Apr 13 08:10:00 azure systemd[1]: Starting n0099-vnstat-alert.service...
Apr 13 08:10:05 azure vnstat-alert.sh[113965]: Error: Failed to get info value for "dbversion" from database (5): database is locked
Apr 13 08:10:05 azure vnstat-alert.sh[113965]: Error: Failed to open database "/var/lib/vnstat/vnstat.db" in read-only mode.
Apr 13 08:10:05 azure systemd[1]: n0099-vnstat-alert.service: Main process exited, code=exited, status=1/FAILURE
Apr 13 08:10:05 azure systemd[1]: n0099-vnstat-alert.service: Failed with result 'exit-code'.
Apr 13 08:10:05 azure systemd[1]: Failed to start n0099-vnstat-alert.service.

and it will exit with some non-zero exit code cause false-positive of alerting.

vergoh commented 2 months ago

Have you tested if setting DatabaseWriteAheadLogging to 1 (and then restarting the daemon) solves the situation?

n0099 commented 2 months ago

I'll try this, also I'm using zfs on /var/lib, is this issue similar to https://github.com/atuinsh/atuin/issues/952?

vergoh commented 2 months ago

I'd don't have any practical experience on using zfs myself. I quick search for sqlite and zfs suggests that enabling WAL (which is what DatabaseWriteAheadLogging does) can help. You should also check the daemon logs at it should produce warnings whenever database write take longer than 4 seconds. The frequency of those warnings before and after the DatabaseWriteAheadLogging configuration change should provide some indication if that helps or not. The read timeout is configured at compile time with DBREADTIMEOUTSECS to 5 seconds.

n0099 commented 2 months ago
$ sudo journalctl -u vnstat -g took
Apr 11 18:30:16 azure vnstatd[2169]: Warning: Writing cached data to database took 39.1 seconds.
Apr 11 18:30:16 azure vnstatd[2169]: Warning: Writing cached data to database took 7.5 seconds.
-- Boot c954ea177ebd4c759c98ef0e1fa04a87 --
Apr 13 03:20:11 azure vnstatd[1889]: Warning: Writing cached data to database took 4.3 seconds.
-- Boot c9c52571a5c7441f95967b2ca23cda41 --
Apr 13 08:10:06 azure vnstatd[1941]: Warning: Writing cached data to database took 6.3 seconds.
Apr 13 12:45:06 azure vnstatd[1941]: Warning: Writing cached data to database took 5.2 seconds.
vergoh commented 2 months ago

Let me know if that setting helped with the read situation or not as I'm not exactly sure if having WAL enabled also avoids getting the read-only error while slow writes are being done at the same time. With ZFS, the DatabaseSynchronous setting may also be one possibility to investigate as part of the slow writes issue with sqlite could be due to sqlite trying to ensure the writes have completed while ZFS is doing that also internally at the same time resulting in unnecessary multiplied checks.

As for improving the detectability of the source of the exit status that's usually evaluated when --alert is used, I'll see if adding exit options 4 and 5 which would match the current 2 and 3 but using exit status 2 (instead of 1 that all the other errors use too) would be the ideal solution or if some sort of --actual-errors-do-not-exit-1 parameter would be better.