Fix possible incorrect behaviour of backup_and_restore.sh with --delete-days parameter

selivan commented 2 years ago

Summary

backup_and_restore.sh has --delete-days parameter that works very straightforward:

find ${BACKUP_LOCATION}/mailcow-* -maxdepth 0 -mmin +$((${1}*60*24)) -exec rm -rvf {} \;

Also the script does rotation without checking if the backup was successful or not.

This is a pitfall for two possible very bad scenarios:

something goes wrong, and all backups since that time become broken. For example, mariabackup is not successful anymore. But backup rotation still works and in N days user will have N corrupted backups and zero good backups.
server goes offline for a time longer than --delete-days in cron job. After going online it will delete all backups except the last taken.

The two scenarios can combine: server goes offline for N+M days, than it goes offline but now docker stops working: /var/lib/docker failed to mount. Backup volume however mounted correctly, so now the cron job creates incorrect backup and deletes all good backups.

I suggest:

Include backup was successful check in the script and run rotation only if it was ok
Replace --delete-days parametes with safer approach like --number-backups-to-keep

Motivation

Users will be less likely to find themselves without correct backup.

Also, if the backup script can provide exit code indication backup success or failure, it can be integrated with monitoring.

Additional context

When I was just starting to work in IT, I've learned the idea about backup rotation only after checking that taken backup is correct the hard way. Let's keep everybody else from that experience :)

iskrant commented 1 year ago

+1

dodedodo commented 1 year ago

A discussion on implementing existing backup software like borgbackup or restic might be appropriate here. This is basic functionality that has been implemented at least a dozen times. Why reinvent the wheel?

mailcow / mailcow-dockerized