oetiker / znapzend

zfs backup with remote capabilities and mbuffer integration.
www.znapzend.org
GNU General Public License v3.0
608 stars 138 forks source link

Very hard to monitor znapzend in production #419

Closed saurabhnanda closed 3 years ago

saurabhnanda commented 5 years ago

I'm building a Grafana dashboard to make sure that the following tasks are running as per schedule:

All znapzend logs are being pushed to a DB and I'm looking for the following patterns to ascertain whether these tasks were successful or not:

However, even if the send/receive fails, the following line is emitted to the logs: send/receive worker for $DATASET done, which makes detecting errors very difficult.

The only way to detect errors via logs is to look for the following pattern:

But this is not ideal, because it will be unable to catch the case where znapzend doesn't even run.

This, coupled with https://github.com/oetiker/znapzend/issues/367 makes monitoring znapzend very difficult in production.

matveevandrey commented 5 years ago

I would suggest you to monitor the destination side with simple bash script which gets all zfs datasets and determines the latest snapshot exists. Then compare it with MAX age allowed and warn you if needed

This kind of solutions has more advantages than just relying on what sender side says (lies?)

matveevandrey commented 5 years ago

If I'm not mistaken I've seen some kind of ZFS Zabbix templates with discovery and trigger that analyze datasets size. I'm pretty sure that template could be easily extended to store time (secs/mins/hours/days) passed sins the latest snapshot for each dataset

gabviv73 commented 4 years ago

I've modified an existing Zabbix template to monitoring snapshot age: GabrieleV/zabbix-zfs-on-linux

psy0rz commented 3 years ago

ZFS-autobackup does very consistent and strict error reporting. Exit codes are also very reliable, check it out at: https://github.com/psy0rz/zfs_autobackup

There is also an example how to monitor it with Zabbix.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.