jeffaco / duplicacy-util

Utility to schedule and run duplicacy backup via the command line
Apache License 2.0
95 stars 25 forks source link

Backing up to multiple storages: an issue with one storage results into no backup to subsequent storages #30

Closed Ossssip closed 5 years ago

Ossssip commented 6 years ago

Hi, I have configured duplicacy-util to make backup to multiple storages:

storage:
    -   name: sftp_storage_1
        threads: 1
    -   name: sftp_storage_2
        threads: 1
    -   name: cloud_provider
        threads: 1

Recently I discovered that if the first storage is not accessible (offline), duplicacy-util catches error code 100 from duplicacy and stops without running backups for other storages. It is intended behaviour? Backing up to multiple destinations is meant to improve reliability (if for some reason one destination fails, another will still have a snapshot for a given day), but it does not work like that with duplicacy-util.

jeffaco commented 6 years ago

@Ossssip Yes, this is the intended behavior, although I'm open to discussion here.

While I routinely back up to multiple cloud providers, I don't want random (different) backups in the cloud. I want two consistent (and valid) backups in the cloud, kept in sync with duplicacy copy. That way, if I need to change a cloud provider, I don't lose historical backups, and everything is consistent.

In this case, my expectation is that if a backup fails, you get an error notification (via email), so you know to take a look. And then, assuming you resolve that, then the next backup will be fine.

Now, that said, there are a few things in play here:

  1. I would like to finish checkpoint code, so in this case, the backup would resume where it left off. Say, for example, that sftp_storage_1 worked but sftp_storage_2 failed. I think "proper" behavior would be to resume the next backup at sftp_storage_2 the next time duplicacy-util ran, let the copy operation make everything consistent, and off you go. Check-pointing is close to done, but things are reliable enough for me where I never got around to finishing that. It was de-prioritized since I barely ever need it.

  2. In your case, checkpoints wouldn't help you, as the backups would still stop.

  3. Duplicacy itself is not always resilient with intermittent failures, so I'm thinking carefully of implementing a "retry" mechanism (via configuration file changes) to allow for a limited (settable) number of retries on failure. If duplicacy itself retried errors properly, that wouldn't be necessary, but with all the storages that seems to be a tall order sometimes. This wouldn't have helped you either, though, since the storage was offline until further action was taken.

If the backup resumed on the next storage, then a failed storage would not cause a problem. BUT strange behavior would come up when it came back online, and the backups wouldn't be 100% consistent (although, arguably, maybe consistent enough). If this is okay, then maybe a new configuration flag, perhaps resume-on-error, could be set, to change this behavior.

Again, I'm open to discussion here. What are your thoughts with desire of consistency of backups in mind?

Ossssip commented 6 years ago

Having a consistent set of backups (same revisions) on different storages seems to be the most straightforward and logical way. However, in my opinion it is not the best way to have a fully unattended backup solution, given current options provided by duplicacy. The shortcomings are:

  1. Running a backup to one remote storage and then copying it from the first remote storage to other(s) is: a. Slow and causes network overhead (re-downloading chunks from the first location and uploading to the other(s)), and b. Depends completely on reliability of the first storage: if it is offline, there will be no snapshot created for a given duplicacy run.
  2. Running backups to multiple remote storages one after another does not guarantee that you will get the same snapshots on these storages (especially if one of the storages is offline for a given duplicacy run).

I believe that backing up to multiple storages is a better strategy compared to backing up to one and copying to others: it is faster and it has higher redundancy. But it is unrealistic to keep consistency across all the storages for a given snapshot on a long run. Instead of trying to solve the (in)consistency issue, one can completely avoid it! The trick is to have different snapshot ids in the duplicacy preferences file. It looks like this:

[
    {
        "name": "sftp_storage_1",
        "id": "backup__sftp_storage_1",

    },
    {
        "name": "sftp_storage_2",
        "id": " backup__sftp_storage_2",
        …
    },
   {
        "name": "cloud_provider",
        "id": "backup__cloud_provider",
        …
    }
]

What happens during that backup? If all storages are available, duplicacy run backups and all three storages receive the same set of file chunks (or a very similar set of file chunks in case some files were changed in between, which is not very likely, as backups are fast), and also few unique snapshot chunks for each storage. If one of the storages is offline, it will not get the today’s backup, but two other storages will do. At any point of time I have an almost consistent set of file chunks across all storages and unique snapshot chunks on each storage. At some point (monthly) I cross-copy everything from these three storages. As a result, I have:

Now, coming to the behavior of duplicacy-util. As you mentioned, for my case I do not need checkpoints neither. So far I solved the issue for myself my having separate duplicacy-util configs for each storage, and I am pretty happy with that solution, as I can see immediately from the email notification header which storage has troubles.

I am not sure if keeping backup consistency should be handled through a sophisticated checkpoint logic on the duplicacy-util side. Maybe it is more reasonable to address it to @gilbertchen and request for some additional functionality of duplicacy itself, like options to sync storages and to backup to multiple storages at once? What do you think?

jeffaco commented 6 years ago

Some of this gets into philosophical differences, and I’d really rather avoid those since there’s no “right” and “wrong” when it comes down to that, it just comes down to preferences.

In your example, most of that is false. If you look at duplicacy documentation, it specifically talks about how to set up a backup to NOT copy chunks down from one cloud service to put on another. You can do that, but as you point out, it’s not efficient.

Instead, I copy (from local disks) to cloud service one. Then I copy (from local disks) to cloud service two. These copies happen in the middle of the night, where the shares are stable and aren’t changing. But even if they did change, I then do a copy from cloud service one to cloud service two. In this case, you ONLY get changed chunks that aren’t in cloud service two. So this does, indeed, give you two identical backups (if the files were changing, you’d STILL get identical backups, but more chunks may be copied down during the duplicate copy operation). But because so little changes, I generally get three or four chunks actually copied.

Now, checkpoints (or what I call checkpoints) will guarantee that is maintained, even if cloud services are down. But you may not have the resiliency you want since you can skip an entire backup interval until things even out (if checkpoints were implemented and working). But even then, when the dust settled, you’d end up with otherwise identical backups with identical chunks.

Doing a complete copy from one cloud service to another can be very expensive indeed, as you’re now downloading a potentially large number of chunks and uploading to another. Keep in mind that on some of my backup sets, we’re talking 2.4TB of data. This is WAY too big to download/upload from/to two different cloud providers (in my opinion, anyway).

In my case, my cloud providers either have been or are a combination of: AWS, Google Enterprise Cloud, Google Drive (not enterprise service), Wasabi, Backblaze B2, and Azure. All of those are highly reliable, highly available services in my experience, so I’ve not experienced your sorts of problems. Google Drive is kind of crummy from a performance basis with certain operations, but it’s effectively free for me, so I deal with it.

It would be very nice if duplicacy itself directly copied to multiple storages in a single run. But I’m guessing (without running it by Gilbert) that wouldn’t be well received - he has a LOT of outstanding issues already, and I’m not sure he’d want to chew that one off. But I think I will mention it to him and see how it’s received, it would definitely be nice.

So, bringing this back to duplicate-util and how it should behave:

  1. I wrote it to be flexible. Even if it does what I want, it may not do what you want. That’s what the configuration file is for, we can have different behaviors.

  2. Independent to this, duplicacy isn’t always careful about handling intermittent errors. Thus, I am contemplating a retry option on a per cloud-provider basis to retry certain types of errors. For example, I’m contemplating that retry: 5 would retry up to 5 times. Since duplicacy will de-dup anyway, it won’t upload multiple times. But that’s a general feature and wouldn’t help in your case (when a cloud provider is down).

  3. To handle your case, perhaps a backup-wide option (in a backup YAML file) like continueOnError: true might continue even if errors occur. So in this case, one of the backups would work, but a backup to a failing cloud provider (or a copy from/to a failing cloud provider) would fail. This would give you what you want at the cost of loss of consistency between backups. In your case, you don’t care, but I certainly would given my 2.4TB backup sets.

  4. As I consider this, checkpointing may not be what some people want. So perhaps, when I do that, I’ll make it controllable with a backup-wide option like checkpoint: false (to disable it). Having checkpoints enabled with continueOnError enabled would be non-sensible, I think, so I probably wouldn’t allow that specific combination of settings.

Would a continueOnError setting in the backup YAML file give you what you want?

Ossssip commented 6 years ago

I agree that there is no one perfect and universal rule – different cases need different approaches. Having a local backup and copying it to reliable cloud storages is probably one of the best, but I do not have local backups on most of my machines, and my remote storages are not so reliable, that is why I use such a complicated/awkward backup scheme. I believe continueOnError would be perfect for me, as well as an option to retry.

jeffaco commented 6 years ago

I don't really trust local backups. They are good for immediate restores, sure. But so are trash bins/deleted items (depending on O/S), if you inadvertently deleted something. If something happens to my computer or location where the computer is, a local backup is of no value.

My complete backup scheme is multi-pronged:

  1. Key data is kept in a RAID volume. So a single loss of a disk (or even a loss of two disks) would not result in any data loss for me. But this is resiliency, and NOT a backup.

  2. I back up to a local external encrypted disk that is swapped with a counterpart in a safe deposit box weekly (the data in the safe deposit box is one week old at worst). This gives me relatively immediate access to my data, assuming my home is intact. It is external, so if something happens to the computer itself, the data is still intact. And if I need to restore 2.4TB of data, I'm not doing that over an internet connection (much faster to get most of my data back).

  3. Automated encrypted backups to multiple cloud providers that I deem reliable (listed in prior posting here). Unreliable cloud providers mean that my data is unreliable, and unreliable backup data makes the backups of minimal value unless you're lucky. I don't want my backups to rely on luck.

  4. As mentioned above, I do use two reliable backup providers, and they are kept in sync as per Backing up to multiple storages. This has a number of key advantages to me: In-sync backups (so I don't perhaps struggle at restore time with what is where), and migration to a new backup provider doesn't mean dumping all my history. That was the case in the past, and that sucks. Being able to get something from six months ago is of real value to me, even if I changed a cloud provider two months ago.

But this is me, and while I think good backup hygiene is important, I certainly don't want to be preachy. You're absolutely right in saying that there is no one perfect and universal rule. Backups have to right for me and my lifestyle. Otherwise, over time, I get lazy and backups don't happen. And that's the worst thing of all: No recent backups when I need them.

You also have to take other considerations into concern: How important is data loss? If data loss isn't a big deal to you, then backups are probably not worthwhile at all. But if data loss is tolerable, then perhaps a local backup is good enough (although automation is important to make sure things happen when they should).

Anyway, this is digressing. I'll look at continueOnError along with a retry option. It would be nice if retry wasn't necessary (duplicacy should be resilient for errors that can be retried). While Gilbert is working on that, duplicacy today is not as resilient as I'd like.

jeffaco commented 5 years ago

I don't think I'll do retry, as I just heard from Gilbert on this post:

I did look at it before releasing 2.1.2 and realized that the fix was in 2.1.1 but somehow it didn't catch the broken pipe error. My plan is to add retrying to the chunk uploader which would be a cleaner and more general fix. And it should be in 2.1.3.

So, given this, I don't think retry makes sense anymore. Cool.

I still agree that continueOnError would be useful for you, @Ossssip, and I hope to take a look at this soon. It's next on my list.

jeffaco commented 5 years ago

@Ossssip This has been idle for a while. I'd like to close out this issue, either by punting the feature or doing the work. Given that others haven't asked for it, I can only assume that it's mostly important to you.

So, some questions:

  1. Are you still looking for this feature? Or have you worked around it? If you have worked around it, I can just close this issue.
  2. If you are still looking for this feature: Would you like to do a PR for this? If your PR works and doesn't break things, I'd definitely take it! 😄
  3. If you are still looking for this feature, but can't or won't do a PR, then I can look at it. I have other things I'd rather be doing, but this has been outstanding long enough ...

So let me know your thoughts, thanks!

Ossssip commented 5 years ago

Dear @jeffaco, thanks for coming back to this issue. I think it can be closed, as I am fully satisfied with the existing version of your tool -- I have created separate configs for different storages, and it works well for me.