tywkeene / autobd

autobd is an automated, networked and containerized backup solution
http://tywkeene.github.io/autobd/
MIT License
18 stars 8 forks source link

Versioned backups. #54

Open tywkeene opened 7 years ago

tywkeene commented 7 years ago

Implement a way to keep track of backups.

E.g: By date/timestamp or revision of file

nytopop commented 7 years ago

Hmm, I've been thinking about how versioning can be integrated for a bit, and I can't think of anything that doesn't change up the local directory tree on the nodes.

Right now, afaik the remote and local trees are near copies, excepting files that were deleted from the source and remain in the nodes index. Including versioning has to change something about that, although it is possible to keep one directory tree on the node with all revisions, and a separate one for local access through the filesystem with only the latest versions.

Ideas:

Track directory tree and file version information in a database on the nodes, store actual files / directories as normal, except names replaced by hashes.

This means no accessing through the local filesystem, have to go through the application, but you get really fast performance due to indexing optimizations in the database, as well as the ability to store different versions of the same file in a deterministic way. This would be especially helpful, as node/node.go seems to do a quadratic recursive loop through linked hashmaps, which is very slow and memory intensive when you have massive directory trees to compare.

So, in this form you would have a sql table on the nodes that looks like:

path | sha512 | size | version info (modified time?)

And a directory structure on the nodes that looks like:

ls -R
.:
really long sha512 of the root

./really long sha512 of the root:
more sha512 of the files

This would still preserve most file attributes, excepting the path. That data would be relegated to database, so it would be a problem if the database was corrupted / deleted. As above ^, a second directory tree can be kept to keep a copy of the latest versions of files in human readable format.

One issue I can think of with this storage mechanism is you might run into file length limits. Maybe if part of the hash was cut off? The full length hash should be in the db anyway.

Store revisions in separate directories grouped by revision date. Index each revision directory.

This seems like it would be slow / cumbersome, compare operations especially as you have to check from newest index to oldest to see if the file was recently changed or not.

Starting on the server:

mkdir root; touch root/a; touch root/b
ls -R
.:
root

./root:
a  b

sync files with autobd
echo "extra data" > b
sync again

You'd end up with this directory structure on the node:

ls -R
.:
1484631632  1484631641

./1484631632:
root

./1484631632/root:
a  b

./1484631641:
root

./1484631641/root:
b

What do you think? Any other ideas? I have some time, I could probably help out with implementing if you're open to contribution.

tywkeene commented 7 years ago

I was thinking of something more simple, perhaps just keeping a tarball of the entire tree per day in a separate, and prune ones older than a certain threshold.

Feel free to go ahead and give it a go. I understand best when I can see the code. :)

After looking around and going through a few issues still opened I feel the need to get my bearings back before I do anything more major. But this seems like it'd be a good next step. Let me know if you have any questions, and again, thanks for the contributions. :100:

tywkeene commented 7 years ago

I'm thinking something like #63 might work. Gonna give it a try once I get through testing everything else.