jazzband / django-dbbackup

Management commands to help backup and restore your project database and media files
BSD 3-Clause "New" or "Revised" License
977 stars 220 forks source link

Worth considering merging or feature-sharing? #121

Closed andybak closed 9 years ago

andybak commented 9 years ago

https://github.com/django-backup/django-backup

pkkid commented 9 years ago

Can you explain more how that might work? Our projects ultimately do the exact same thing but our code seems very different.

andybak commented 9 years ago

I hadn't thought about it terribly hard to be honest but it seems a shame to have two apps that are so similar.

I suppose we could aim for feature parity and then deprecate one. I'd need to look at your project a bit closer and you'd need to see if there was anything you didn't want to implement that we do.

At the moment - I just wanted to start a dialogue with you. It might be a terrible idea on reflection!

ZuluPro commented 9 years ago

Hello @andybak, Before start to work with Django-Dbbackup, I've made a state of art of Django's backup apps: Here's an article about: https://anthony-monthe.me/weblog/2015/09/11/backup-django/

What do you think about ?

Do not hesitate to ask questions about Dbbackup. I'm looking closer yours, Do I have to look at your fork ?

andybak commented 9 years ago

Thanks @ZuluPro - very informative.

For the record - django-backup (the new version) supports rsync for media backups. And it does incremental linked rsync sets to save space.

benjaoming commented 9 years ago

For the record - django-backup (the new version) supports rsync for media backups. And it does incremental linked rsync sets to save space.

That doesn't sound like a difficult feature to implement. A good feature, nontheless.

I think the codebase of django-dbbackup is more mature in terms of architecture and tests.

One of the features, implemented by @ZuluPro , was a StorageBackend class, which should make it easier in the future to add new types of storage engines, such as invoking rsync.

ZuluPro commented 9 years ago

Rsync for me is out of topic: It seems obvious to use rsync when talking about backup, But rsync has no plus for dbbackup's tasks, files are uploaded/downloaded one by one and there's no need to compute delta between local and remote.

I closed this issue: https://github.com/django-dbbackup/django-dbbackup/issues/86

andybak commented 9 years ago

rsync made a huge difference to the media backups for django-backup.

  1. Incremental backup meant the time for a nightly backup was much shorter and the amount of data transferred was less.
  2. symlinking unchanged files enabled a huge saving in disk usage (essentially 'Time Machine' style backups) (see https://blog.interlinked.org/tutorials/rsync_time_machine.html for a summary of this technique)
benjaoming commented 9 years ago

There are two different modes of backup:

1) Snapshot the entire media catalogue and archive it. 2) Sync the latest version.

Those are different, using rsync for 2) is pretty easy from command line, doesn't need a django/python layer to complicate it IMO.

andybak commented 9 years ago

I'm a trifle confused.

I'm talking about using rsync to snapshot the entire catalogue. Using it improves the speed and storage usage hugely.

I've got a site that has several gb of media. With rsync I can have hourly snapshots and each one uses only a fraction of the space a full backup would due to rsyncs use of links for unchanged files.

Without rsync it would be too slow and take up too much space to have hourly backups.

pkkid commented 9 years ago

@andybak Can you explain the process of using rsync to create snapshots a bit more? I always thought rsync just created a mirror of the data on another machine. If using rsync, would it be possible to get the state of the media directory at some point in the past, or would you only be able to retrieve the current state.

edit: perhaps this is the reading we should do: http://www.mikerubel.org/computers/rsync_snapshots/ -- It makes sense to me, it symlinks the unchanged files (as you mentioned above).

benjaoming commented 9 years ago

@andybak yes but it's still in the 2) department :) -- what are the reasons why you need a Python/Django layer for this? Is it significantly more convenient? Are you storing snapshot logs in a Django environment? I'm trying to think of reasons not to just write a crontab entry :)

andybak commented 9 years ago

I'm trying to think of reasons not to just write a crontab entry :)

Doesn't the same argument apply to db backups? For me "backing up a website" is a task that consists of:

  1. Defining a schedule
  2. Deciding how many historical backups to retain
  3. A destination location
  4. Actually doing the backups
  5. Optional but very handy - the ability to restore backups

The substance of a backup has to be both the db and media files - it's not a backup at all without both of these. My database isn't much use if uploaded files are missing so I want both backed up. And it makes sense to manage them from the same app so that destinations and schedules are handled the same way.

benjaoming commented 9 years ago

@andybak I agree, but there's a subtle, sensitive problem about overlapping with rsync, because it has a really complicated CLI that I'd find it hard to put in django-dbbackup for the sake of file copying.

It's worth discussing :)

django-dbbackup has its name from being centered on the database, making it possible to easily reuse the database configuration for having a single, native interface for creating backups of the database without worrying about which engine it is.

If this were to translate to the media directories, we should investigate how we can do the same thing for various cloud storage engines.

benjaoming commented 9 years ago

On the subject of scheduling backups, I don't think we should handle what crontab does to perfection.

pkkid commented 9 years ago

django-dbbackup has its name from being centered on the database, making it possible to easily reuse the database configuration for having a single, native interface for creating backups of the database without worrying about which engine it is.

This should be our introduction and mission statement. Very clearly stated.

benjaoming commented 9 years ago

Glad we are in tune :D

andybak commented 9 years ago

I don't think we should handle what crontab does to perfection.

The advantage is to combine scheduling and retention info in the same setting. For example:

BACKUP_DATABASE_COPIES = {
    'hourly': 24,
    'daily': 7,
    'weekly': 4,
    'monthly': 12,
}

would be a typical configuration for django-backup. It means "I want you to keep the 24 most recent hourly backups. After that keep 1 for each day for the preceding week. For backups older than a week just keep 1 per week" etc...

(we currently don't control the time each backup occurs but this is a feature that's been requested)

Does django-dbbackup handle retention?

I agree, but there's a subtle, sensitive problem about overlapping with rsync, because it has a really complicated CLI that I'd find it hard to put in django-dbbackup for the sake of file copying.

Which brings us back to my original suggestion. This is all stuff already handled by django-backup and fairly battle-tested. If you want to borrow it then it's fairly simple code.

benjaoming commented 9 years ago

@andybak Not sure how the implementation would work out, @ZuluPro is working on a new base class for DbCommand which I think is going to handle it.

I don't think we should handle scheduling at all. Look at this syntax:

https://en.wikipedia.org/wiki/Cron#CRON_expression

For retention, it's possible to use various script-base approaches, there are many off-the-shelf backup script solutions to this.

Main use case call stack:

crontab -> backup.sh -> manage.py dbbackup

backup.sh can handle retention

I've seen various cases of trying to write python-based schedulers. I don't really like any of them, I still prefer crontab calling django management commands :) Celery is also another option... but kind of bloated and insane to setup if you wanna be up and running in 10 minutes.

andybak commented 9 years ago

Again I think we're talking at cross-purposes. Django-backup isn't responsible for running itself on a schedule. You just ensure the management command is run frequently enough (once per hour for a daily backup etc) and it handles everything else.

And handling retention is surely a core feature for a backup app? How many apps do I need to handle backups! One preferably...

benjaoming commented 9 years ago

@andybak running the app "frequently enough" ? This would be a good example of where the architecture starts breaking. So you want crontab to run it once per hour, but not to manage the schedule?

Regarding retention, then yes, no problem IMO handling it elsewhere. Looks like this:

DB_BACKUP_ROOT=$BACKUP_ROOT/db-`date +%Y-%m-%d`
DB_BACKUP_ROOT_MONTHLY=$BACKUP_ROOT/db-monthly
mkdir -p $DB_BACKUP_ROOT

# Remove backup folders older than 10 days
find $BACKUP_ROOT -name db* -maxdepth 1 -type d -mtime +10 -exec rm -rf {} +

# Put your own logic here, like calling dbbackup
# ...

if [[ "$1" == "monthly" ]]
then
    cp -R $DB_BACKUP_ROOT $DB_BACKUP_ROOT_MONTHLY
fi

In your crontab, you can have a cronentry that calls backup.sh with --monthly.

andybak commented 9 years ago

This would be a good example of where the architecture starts breaking.

I don't follow. Strikes me as a good way to separate the duties of a backup app from a cron/task runner. "If you want daily backups ensure you're running it more than once day" The backup app then has a clear task of checking when it last ran and deciding if another backup is needed.

Regarding retention, then yes, no problem IMO handling it elsewhere. Looks like this:

The ideal number of bash scripts in my toolchain is zero. ;-)

ZuluPro commented 9 years ago

The ideal number of bash scripts in my toolchain is zero. ;-)

Like in functionnal tests the primary API of dbbackup is django.core.management.execute_from_command_line

See https://github.com/django-dbbackup/django-dbbackup/blob/master/dbbackup/tests/functional/test_commands.py

After you can use it thought manage.py, Celery or whatever.

I don't follow. Strikes me as a good way to separate the duties of a backup app from a cron/task runner. "If you want daily backups ensure you're running it more than once day" The backup app then has a clear task of checking when it last ran and deciding if another backup is needed.

Dbbackup is designed as a Django Third App for backup and restore a Django project. I think its funny trick is the availability to quickly get started. I think, scheduling won't be one of its tasks and will be delegated to developers.

Regarding retention

I think it is the only one scheduling aspect everybody want to have, and we let the possibility to disable it.

benjaoming commented 9 years ago

The ideal number of bash scripts in my toolchain is zero. ;-)

So conceptually, it would look like this... loosely translated... but I mean it's the same concept.

import os
import shutils
from datetime import datetime

now = datetime.now()
DB_BACKUP_ROOT = "{}/db-{}-{}-{}".format(sys.argv[1], now.year, now.month, now.day)
DB_BACKUP_ROOT_MONTHLY="{}/db-monthly".format(sys.argv[1])

if not os.path.exists(DB_BACKUP_ROOT):
    os.makedirs(DB_BACKUP_ROOT)

# Remove backup folders older than 10 days
for node in os.listdir(sys.argv[1]):
    fullpath = os.path.join(sys.argv[1], node)
    if os.path.isdir(fullpath):
        (mode, ino, dev, nlink, uid, gid, size, atime, mtime, ctime) = os.stat(fullpath)
        if (now - time.ctime(mtime)).days > 10:
            shutils.rmtree(fullpath)

# Put your own logic here, like calling dbbackup
# ...

if len(sys.argv) > 1 and sys.argv[2] == "--monthly":
    shutil.copy(DB_BACKUP_ROOT, DB_BACKUP_ROOT_MONTHLY)
andybak commented 9 years ago

That seems to be just "remove backups older than x"

benjaoming commented 9 years ago

@andybak the last line of code makes a snapshot of the latest and copies it to monthly backups. But this is just an example of expressing your custom scheduling/retention needs in a Python script.

I think it's about drawing a clear line in the sand to the user and say:

Look, if you want to just quickly dump backups of your database and media files (i.e. all data of a project), then here are the tools. If you want to schedule it in a specific way and have retention on various levels, you're better off writing a script and using tools like rsync and crontab.

Like, what would you do if the user wants monthly data stored on server A and other backups on server B? What if they wanna only backup depending on file size? What if they want their monthlies as complete snapshots and weeklies as incremental? What if they wanna exclude all *.cache files? What if they don't want backups on Sundays? What if a once-per-hour job takes over one hour? What if you have one single mission-critical database that you need backed up every 10 minutes?

We're not here to solve all that, because other tools do so much better :)

But I love to see another django backup tool and other approaches to this! Backup is a broad area so I'm sure there's room for lots of POVs :)

andybak commented 9 years ago

This has been a useful discussion then and it's made it clear there's still a place for django-backup as it has different aims.