jedbrown / git-fat

Simple way to handle fat files without committing them to git, supports synchronization using rsync
BSD 2-Clause "Simplified" License
621 stars 137 forks source link

use git for fat object store #1

Open jedbrown opened 11 years ago

jedbrown commented 11 years ago

Instead of writing our own objects, we can use git to store the objects independently, tagged by sha1 of the content (using a lightweight tag). This is an eventual scalability problem because git does not support large numbers of refs well -- it ends up slowing everything down. Further discussion here:

http://thread.gmane.org/gmane.comp.version-control.git/182158

Instead, the objects could be packed up into commits, but due to the arbitrary subset problem, talking to remotes would not be as simple as push/pull of some refs. Instead, we'll probably need to create a new commit to pack up exactly what is requested.

jedbrown commented 11 years ago

Without upstream changes in git, this is likely to be a performance hit because making two passes over the data is unavoidable. Two possible resolutions:

  1. Add an optional command-line argument to pass the object size to a clean filter. Teach git hash-object --stdin to recognize an additional argument containing the uncompressed size. This will change the SHA1 in the git-fat cleaned stub from a hash of the object to a hash of blob SIZE\0...data..., thus requiring some migration for existing users. We could use the SHA1 of the plain object at the expense of putting a SHA1 filter before passing the data to git hash-object --stdin, but taking two SHA1s of the data seems excessive.
  2. Add a new raw object type so that a raw zlib-compressed file can be stored under its SHA1. The size could be provided externally in decompression (because we have it). This approach seems unlikely in lieu of a different use case relevant to git upstream.
jedbrown commented 11 years ago

The jc/hidden-refs branch seems to have the support we'd like for suppressing advertisement of hidden refs. Setting a server-side configuration variable uploadpack.hiddenrefs=refs/fat will hide those refs from the server's advertisement (so that git ls-remote is always small).

http://thread.gmane.org/gmane.comp.version-control.git/215054/focus=215897

kynan commented 11 years ago

How about directly writing git packfiles instead of commits? This might avoid having to pack/repack into commits.

You might want to have a look at bup: They're directly writing packfiles and also split large files with rolling checksums. I know you want to avoid external dependencies, but maybe you could borrow some ideas.

Disclaimer: I don't know git-fat or bup in detail, so this might not make any sense.

jedbrown commented 11 years ago

Hmm, I don't see writing separate files as such a big deal because we need to repack to use delta compression anyway. Also, I think the usual case is to have larger files rather than a zillion tiny files (which is where packfiles shine).

Does bup support selective transfer? It might be interesting to support bup as an alternative backend. (Note that git-fat can support any number of backends without losing compatibility.) Their approach of splitting files into chunks that become separate blobs is necessary to operate on streams.

kynan commented 11 years ago

If I understand it correctly bup doesn't use delta compression: It chunks files, computes checksums of those chunks and then only writes the chunk to the new packfile if it's not already in the repository.

Do you mean selective as in which files to transfer? I don't think that's supported. The unit is always an entire backup afaiu.

jedbrown commented 11 years ago
  1. Right, it sounds like they use some smart way of computing the checksums so that they identify identical blocks even when they shift around.
  2. Yes, and I think transfer of selected files is essential. Removing fat data from Git's DAG is of limited utility if you can't access part of it without transferring all of it.