juliangruber / backer

wip distributed backup / file mirroring tool
MIT License
60 stars 2 forks source link

dropbox scale #8

Open juliangruber opened 11 years ago

juliangruber commented 11 years ago

I read somewhere on the internet that dropbox only works well until 300,000 files.

In order for us not to hit that limit, let's try to find out what could break at that number.

Merkle tree size

The formula for the size of a perfect binary tree is:

(<number of files> * <number of blocks> * 2 - 1) * <hash size>

If every file is represented by only 1 SHA1 hash, the tree will be 12 MB in size. And in case the overhead of JavaScript is too big, just store it in a LevelDB (which might happen anyways).

So, no problem there. but, we really want to store multiple hashes per file, so we can sync file chunks instead of just whole files.

Let's say we use 1024 byte blocks, as used in the Tree Hash Exchange Format, and assume that our average file size is 200kb -> 200 blocks per file.

The resulting tree would be 2,4 GB big, for 57 GB of files, which imposes 4% overhead. That's a size that can be well dealt with in a LevelDB.

Recalculating hashes

Tree size is just one factor though. Think of the cost of recalculating trees. Whenever a file is modified, as many hashes as the tree is deep have to be recalculated.

A tree for 300,000 files, 200 blocks each, is Math.floor(log_2(300000*200*2-1))=26 nodes deep. Calculating 26 random hashes takes <1ms on my machine, so that's no problem either.

What else could break?

juliangruber commented 11 years ago

Using SHA256: Tree size: 3.8GB, tree overhead: 7%. Using SHA512: Tree size: 7.4GB, tree overhead: 13%.

dominictarr commented 11 years ago

So, I don't think that large trees will be a problem, if we use my merkle module. I have been working at getting it efficient, and currently it's can generate a tree for 300k hashes in less than a second, (it behaves linearly, up to about 1 million, and then starts to slow down - seems like GC stuff)

I've only put a weekend into this so far, and it's plenty good for my current goal (npm scale, 100k objects). And will be able to be optimized further when it becomes a problem (I'm thinking: use structs inside buffers, to get out of the v8 heap)

see my scale test https://github.com/dominictarr/merkle/blob/master/test/scale.js

The other thing I'd be worried about is the how you represent the files in the current version, if you have one object, with pointers to everything, that file will get very large, but if you had a tree object that only pointed to it's files, or to it's subdirectories (the same way as git does it)

This would still get large if there where many files in a single directory, but, since that would also make the fs slow, people avoid that.

Dividing up the files into chunks would work really well for append only formats, but not really be much better for files that get inserted into random places (like text files)

I'd probably go for larger chunks than 1024 bytes, or have larger chunks for large files.

juliangruber commented 11 years ago

Yeah, I get the feeling that blocks sized chunks are only good for bittorrent like use cases where the file contents never change.

But then, if we still want to avoid syncing whole files when only parts changed, the only way I see is to sync patch files. That requires for every change that both the old and the new version of a file are available, so they can be diffed. As soon as that's done the old version can be deleted (or kept, if we want revisions).

Because we can only add "post-hooks" to the filesystem, we'd have to store copies of every file somewhere on the fs or in a leveldb.

juliangruber commented 11 years ago

Maybe there needs to be multiple modes, based on what data you want to save.

Files that almost never change and are mostly resting on disk to maybe be used later, or are read-only like foto albums, would use block replication.

Files that you need every day are way less in number, but change a lot, so for those we could keep copies and only send patches.

juliangruber commented 11 years ago

from the dropbox docs:

Before transferring a file, we compare the new file to the previous version and only send the piece of the file that changed. This is called a "binary diff" and works on any file type. Dropbox compresses files (without any loss of data or quality) before transferring them as well. This way, you also never have to worry about Dropbox re-uploading a file or wasting bandwidth.

juliangruber commented 11 years ago

Just learned about rdiff, which is part of librsync. Here's how it works:

Every peer keeps a local signature file of each file in the sync directory, generated via

$ rdiff signature <file> > ~/.backer/<file>.sig

For a 150kb json the signature file is 3kb in size. When syncing, the peer with the older version of a file in question sends his signature to his syncing peer:

fs.createReadStream('~/.backer/<file>.sig').pipe(remoteStream);

The receiver then creates a patch file based on its newer version of the file and the signature:

$ rdiff delta <file>.sig > <file>.patch

which then is sent back to the first peer:

fs.createReadStream('~/.backer/<file>.patch').pipe(removeStream);

who applies it to his file:

$ rdiff patch <file> <file>.patch

And now both have the same version.

juliangruber commented 11 years ago

I suggest using rdiff for exchanging deltas on initial replication, in combination with merkle trees, to exchange which files are different.

After initial replication, rdiff is used to forward all the changes made.

juliangruber commented 11 years ago

here's an even better rdiff approach:

Every peer keeps a local signature file of each file in the sync directory, generated via

$ rdiff signature <file> > ~/.backer/<file>.sig

and makes sure it's always up to date.

Whenever the source file changes, a peer creates a patch:

$ rdiff delta <file>.sig > <file>.patch

This patch then gets replicated to its peers, who then apply it to their local versions via:

$ rdiff patch <file> <file>.patch
dominictarr commented 11 years ago

you could use merkle tree to figure out what objects the other side has, and then use rdiff to compact an update and transfer the files in a "pack". This is more or less how git does it.

But then, you are pretty much writing git. Should probably focus on the 80% usecase at first.

czzarr commented 11 years ago

sorry I'm a real noob here as I'm kind of out of my depth and trying to learn by following you guys, but I recently read this and thought it might be relevant. https://www.dropbox.com/developers/blog/48/how-the-datastore-api-handles-conflicts-part-1-basics-of-offline-conflict-handling

"The Dropbox server stores the full list of changes for each datastore, and the state (a.k.a. snapshot) of a datastore can be obtained by executing the entire list of changes, in sequence, starting with an empty datastore. (The practical way to obtain a snapshot is different, and more efficient. :-)

The list of changes stored on the server is structured into deltas, where each delta is a list of changes that has been labeled with a revision number. Revision numbers can also be used to refer to specific datastore snapshots (though not every snapshot is assigned a revision). The initial state of a datastore, completely empty, always has revision 0, and each delta increments the datastore's revision by 1. A delta is labeled with the revision number of the state that precedes it, so the very first delta has revision 0, and after executing it the datastore has revision 1. For this reason we sometimes call the delta's label its base revision."

dominictarr commented 11 years ago

@czzarr that sounds about right. You may also be interested to look at how git works, as there are some similarities. the git-internals section of the git site is very good http://git-scm.com/book/en/Git-Internals