gkiefer / backup2l

backup2l - A Low-Maintenance Backup/Restore Tool
https://gkiefer.github.io/backup2l
GNU General Public License v2.0
47 stars 16 forks source link

Checksum compare to find changed files? #6

Closed Apollon77 closed 3 years ago

Apollon77 commented 7 years ago

hey,

I use your backup scripton some machines, and also one with an InfluxDB and I want to backup the DB backup. When you create a DB backup for InfluxDB it generated many files with the data because InfluxDB uses "shards" (data junks with all e.g. 7 days) internally you get one file per shard.But they all get a new date when they are generated. But older shard-backup-files that were not changed have the same data in, but a new timestamp. Here it would be great to have the file content (means checksum) to be used to identify changes instead of the date o such.

Is there any chance to get this feature?

gkiefer commented 7 years ago

If someone out there is willing to implement this feature properly and submit a patch, I will be happy to integrate it.

Some comments/guides on this (including the reasons why I personally did not miss the feature for more than 15 years - but, of course, opinions are always subjective...):

  1. The feature would need to be optional (see 2.).

  2. Calculating checksums of each file on each backup run has a very serious impact on performance. Caching of of checksums can partially help and is probably mandantory, but that would need to be implemented very carefully (considering unplanned interruptions, user-provided drivers etc. ...).

  3. By the concept of hierarchically incremental backups, the waste of storage is limited to O(log N), where N is the number of backups. This mechanism alliviates the waste of storage capacity (though not the amount of write operations, which may be a problem if the backup device is an SSD).

boppy commented 5 years ago

I am not a big fan of storing Hashes for each file. Performance and storage questions arise...

Calculative example: The data storage on my primary server has about 600,000 files. With a sha256sum (64 hex digits) for each file + file path (assuming 36 chars per path for easier calc) it's 600,000 * 100 = 60,000,000, around 60 Mb... These data has to be present while comparing.

The other way around would be to have the files named "better". It's like it is with log(-rotation)... Per default it only adds an incremental number to the files (my.log -> my.log.1 -> my.log.2.gz), but you can configure it to add dates (my.log -> my.log.20181224 -> my.log.20181223.gz), so it can be backed up easily without comparing file hashes...

What about doing the same for your db backup? You can run it at PRE_BACKUP containing something like

oldpath=$(pwd)
cd /path/to/my/db/backup
rename 's/^\d+T\d+Z(.*)$/backup$1/' 2*
cd "$oldpath"

Just try it. Navigate to your backup an run

rename -n 's/^\d+T\d+Z(.*)$/backup$1/' 2*

The -n option will just output without renaming anything. So you can check, if it's working out for you. I checked it against the file listing provided on this page and it works flowlessly.

gkiefer commented 3 years ago

Closing this for now, since the benefit of such an option is questionable, and no other comments/contributions came up in the past two years ...

Thanks @boppy for the help and explanation!