hydrusnetwork / hydrus

A personal booru-style media tagger that can import files and tags from your hard drive and popular websites. Content can be shared with other users via user-run servers.
http://hydrusnetwork.github.io/hydrus/
Other
2.27k stars 148 forks source link

Incremental backup for databases #1182

Open ehawman opened 2 years ago

ehawman commented 2 years ago

I know you hate backup stuff, but hear me out as I think this is worth your time.

Given the nature of Hydrus, users are going to have a massive DB file (along with ancillary also-large DBs). In my case that means ~60 GB of mostly-duplicate data being copied over every single backup. It takes the majority of an hour to complete to my remote backup. This is time I can't use Hydrus, which has resulted in delaying (and forgetting) to do regular backups.

If we could incrementally backup the DB, that would shave almost the entirety of that time off, down from ~1h to potentially a handful of seconds. Naturally there is no native SQLite3 tool to accomplish this, but it looks like someone has started this process here:
nokibsarkar / sqlite3-incremental-backup based on this StackOverflow thread

Now I could create a batch/PWSH script to clunk through this process, but I think it should be natively baked into Hydrus for two reasons:

  1. The easier backing up is, the more likely people are to do it, and the fewer gut-wrenching DBsasters people will have.
  2. Third party tools require the DB to be available during this process. This means Hydrus needs to shutdown, and shutdown clean/quick, which can be easier said than done given the number of balls Hydrus juggles sometimes. A built-in incremental backup system can lock the DB, do its thing, and resume without any shutdowns, hangs, or other headaches (or at least, gracefully manage them should they arise).

Thank you for your time reading this and for creating the definitive image manager.

Zweibach commented 2 years ago

This means Hydrus needs to shutdown, and shutdown clean/quick, which can be easier said than done given the number of balls Hydrus juggles sometimes.

https://hydrusnetwork.github.io/hydrus/developer_api.html#manage_database_lock_on

hydrusnetwork commented 2 years ago

Thank you for this suggestion. My general feeling on backups is that I am not going to be able to write a backup system that is better than third-party, so if I do integrate anything into the program from now on, I want it to be basically completely plug and play right out of the box.

I've had a couple of conversations recently about different ways to do incremental backup. SQLite has an internal API for it, but as far as I know you can only really conveniently talk to it using C, whereas we are talking to SQLite using python, through a layer that can't see the backup API. This module you are pointing to uses a different but neat custom solution to do incremental backup, although I am not sure how it performs if the file it is hashing blocks about is remote, and there may be other technical issues. I don't think it is available on pip either, and it requires NodeJS which we don't currently run, so there would be additional things to do to integrate it into hydrus.

Since you are keen on this idea, can I ask you to test this yourself? Try figuring out a script and doing some incremental backups and let me know how it goes. If everything is great and this thing appears in barebones python on pip, I'd love to integrate it, but if that isn't possible, I'd still be interested in knowing how it goes. If you like it, maybe you could write a little guide and an example script, and I can integrate that into the help for other people in your situation?

hydrusnetwork commented 2 years ago

I just remembered, because another user was talking about it--you can compress a hydrus database using 7z very efficiently, and usually fairly quick, which may be an alternative and easy relief when doing simple copy-backup to remote locations.

ehawman commented 2 years ago

@hydrusnetwork I can't get to this anytime soon, but I can throw it on my todo list for sure. I have never attempted a project like this so I can't promise anything, but I'm definitely willing to try.

As for 7z, that makes a lot of sense. You could have backup_hydrus.ps1 which does:

api_stuff?/manage_database/lock_on
try {
      cd \path\to\hydrus\dbs\
      7z a dbs.7z *.db
      & \path\to\this_backup_ignores_dbs.ffs_batch
}
catch {
    echo "you got errors or something idk"
}
finally {
    api_stuff?/manage_database/lock_off
}

Also @Zweibach thanks for the suggestion!

hydrusnetwork commented 2 years ago

@ehawman yeah, a guy was saying in discord if you 7z all four database files into one archive, you can get it down to about 1/15th the original size (since they share so much data between each other). I know you can do about 1/5th by doing each file on their own, so I think I'll do some testing and write this up for the backup help anyway.

ehawman commented 2 years ago

@hydrusnetwork just did a test compressing w/ 7z

7z a dbs.7z ./*.db

Input: 60.53 GB
Archive: 16.91 GB Runtime: 44m 21s

This is on a SATA SSD. An aging AMD Ryzen 7 1800X could be a factor but probably not that big of a difference.

That's a little more than 1/4th the total. Personally I'm not really hurting for disk space, and I'd want this process to take the least amount of time possible. Maybe I could play with different settings or something, but I think the incremental backup is going to have to save the day on time.

hydrusnetwork commented 1 year ago

It turns out the python DB API actually supports this now:

https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.backup

So incremental backup is back on the table. I have make a job for it, and will think about it for a bit, and then when I have a 'medium size job' week to spare, I'll give it a go. This function they have added seems to be everything we need to make a silent background backup.