jborg / attic

Deduplicating backup program
Other
1.11k stars 104 forks source link

Support for Amazon S3? #136

Open victorhooi opened 9 years ago

victorhooi commented 9 years ago

Is there any way at all to use Amazon S3 as a remote store? I gather that it's built on git, and hence currently it works as long as your remote host has SSH and the git command.

But I'm curious if a S3 backend would be feasible at all?

positron commented 9 years ago

I had this idea as well. Obviously using plain S3 wouldn't work since you need something running on the back-end, but using the newly announced AWS Lambda might allow you to use S3 as a file store and short-running lambda processes to do the backend processing.

Caveat: I know next to nothing about attic or Lambda.

siteroller commented 9 years ago

+1 - If there was some way to support Amazon it would be hugely important. Considering that even a little bit of rot will destroy a huge amount of backups, it is overly important that the backup is stored in the most sturdy way. Amazon offers rolling redundancy for $0.03/GB/month or less, and I would trust Amazon much more than I would any other VPS, even if I could find one that offered 100GB space for the same $3/mo. Even better would be a way to back up to glacier. Meanwhile this is a deal breaker.

Ernest0x commented 9 years ago

Perhaps it would help if those of you who are interested could try something like s3ql and report the results.

siteroller commented 9 years ago

First I heard of S3QL, looks neat, but haven't been able to get it to install to 12.04.
Will post once we have results.

Considering that S3QL does dedup and encryption, what does Attic add in this case?
Why S3QL instead of any of the many alternatives?

jdchristensen commented 9 years ago

victorhooi: attic is not based on git. It needs to have a copy of attic running on the remote host, so I don't think it could use Amazon S3 directly. But if you mount some amazon s3 storage locally, e.g. using S3QL, it should work fine. (It would be interesting if someone tested this and commented on how fast it is.)

siteroller: I don't know much about S3QL, but it sounds to me like it does deduplication based on fixed block positions. Attic uses a rolling hash method to determine block boundaries, so it should be more space efficient.

Ernest0x commented 9 years ago

@siteroller: S3QL and Attic are two different things. On one hand, Attic is an archiving utility that needs a filesystem to create its repositories on. On the other hand, S3QL can provide such a filesystem on top of S3 (or other storage services). I have not tested this combination myself, there may be some performance issues, but it seems a feasible scenario for those who want to use S3. Of course, as you said, there are alternatives to S3QL. So far, I am not aware of any results with either S3QL or any other alternative.

The extra layers (deduplication, encryption, etc.) that S3QL provides may be useless when used in combination with Attic, since Attic does these things too at a repository level. So, it makes sense if they can be turned off at filesystem level in order to reduce performance penalties. That said, if it was possible to turn off compression of Attic repositories (currently it is not), it would be an interesting experiment to see what would be the benefit of deduplication at file system level (done by S3QL) in combination with deduplication at repository level (done by Attic), so that data across multiple Attic repositories is deduplicated too (at file system level).

jscinoz commented 9 years ago

@positron Unfortunately, Amazon's Lambda functions only officially support nodejs (although python is available in the environment), and can only live for 60 seconds per request. It could be possible to hack something together that did a request per block or per X blocks, but I imagine that'd require some fairly substantial changes to attic.

It could be possible to use boto to instead automatically spin up a minimal docker image on ECS to run attic serve, and once any necessary processing was complete, store data in S3, but this seems a bit of a hack and is sure to be inefficient. You could explore more involved solutions, but potentially more efficient, such as analysing the repository via an ECS hosted instance of attic, but then requiring the client to insert files directly into s3, but this quickly grows rather complex, and I'd argue is somewhat beyond the scope of a simple backup tool.

It may well be best to backup to a locally mounted s3ql instance. although, as @Ernest0x has pointed out, this does result in some duplication of functionality (compression, encryption & de-duplication)

geckolinux commented 9 years ago

Hi everyone,

I'm interested in using Attic to backup my webserver to an Amazon S3 bucket. I've been using Duplicity, but I'm sick of the full/incremental model, as well as the difficulty of pruning backups. I love the ease of use and features that Attic provides, but I don't really understand the internals and I'm not sure if it will work with Amazon S3 storage.

Specifically, I'm considering mounting my S3 bucket over FUSE, using one of the following three options:

Any comments on which, if any would be more appropriate? And how tolerant would Attic be to S3's "eventual consistency" weirdness?

Additionally, I need to plan against the worst-case scenario of a hacker getting root access to my server and deleting the backups on S3 using the stored credentials on my server. To eliminate this possibility, I was thinking about enabling S3 versioning on the bucket so that files deleted with my server's S3 user account can still be recovered via my main Amazon user account. Then, I would have S3 lifecycle management configured to delete all versions of deleted files after X amount of time. In this case,

Again, my concerns are based on me not really understanding all the black magic that happens with all the chunks and indexes inside a Attic repository, and how much they change from one backup to the next.

Thanks in advance for the help!

ammojamo commented 9 years ago

Another possible solution, what about using s3cmd sync?

This would involve making a backup first to a local directory, then running s3cmd sync --remove-deleted ... to sync the local directory to s3.

Caveats:

positron commented 9 years ago

@sb56637 Not related to this issue, but you should use IAM roles so if a hacker gets root access to your server the only permissions he has is s3:PutObject to a single bucket.

ovizii commented 9 years ago

I'm a bit late to the party but I only discovered borgbackup now :-)

I'm about to convert from duplicity which I must say has been doing a brilliant job for me for years because it supports all sort of dumb remotes.

So my question is:

Are there any other remotes on your roadmap for borgbackup? i.e. S3 or SFTP?

I'm about to start using borgbackup and can either use NFS instead of (S)FTP and am also about to give iSCSI a try.

ThomasWaldmann commented 9 years ago

@ovizii please note that this is the issue tracker for attic.

ovizii commented 9 years ago

LOL, really sorry about that, will locate the appropriate place and ask again.