Open jedbrown opened 11 years ago
Without upstream changes in git, this is likely to be a performance hit because making two passes over the data is unavoidable. Two possible resolutions:
clean
filter. Teach git hash-object --stdin
to recognize an additional argument containing the uncompressed size. This will change the SHA1 in the git-fat cleaned stub from a hash of the object to a hash of blob SIZE\0...data...
, thus requiring some migration for existing users. We could use the SHA1 of the plain object at the expense of putting a SHA1 filter before passing the data to git hash-object --stdin
, but taking two SHA1s of the data seems excessive.raw
object type so that a raw zlib-compressed file can be stored under its SHA1. The size could be provided externally in decompression (because we have it). This approach seems unlikely in lieu of a different use case relevant to git upstream.The jc/hidden-refs
branch seems to have the support we'd like for suppressing advertisement of hidden refs. Setting a server-side configuration variable uploadpack.hiddenrefs=refs/fat
will hide those refs from the server's advertisement (so that git ls-remote
is always small).
http://thread.gmane.org/gmane.comp.version-control.git/215054/focus=215897
How about directly writing git packfiles instead of commits? This might avoid having to pack/repack into commits.
You might want to have a look at bup: They're directly writing packfiles and also split large files with rolling checksums. I know you want to avoid external dependencies, but maybe you could borrow some ideas.
Disclaimer: I don't know git-fat or bup in detail, so this might not make any sense.
Hmm, I don't see writing separate files as such a big deal because we need to repack to use delta compression anyway. Also, I think the usual case is to have larger files rather than a zillion tiny files (which is where packfiles shine).
Does bup
support selective transfer? It might be interesting to support bup
as an alternative backend. (Note that git-fat can support any number of backends without losing compatibility.) Their approach of splitting files into chunks that become separate blobs is necessary to operate on streams.
If I understand it correctly bup
doesn't use delta compression: It chunks files, computes checksums of those chunks and then only writes the chunk to the new packfile if it's not already in the repository.
Do you mean selective as in which files to transfer? I don't think that's supported. The unit is always an entire backup afaiu.
Instead of writing our own objects, we can use git to store the objects independently, tagged by sha1 of the content (using a lightweight tag). This is an eventual scalability problem because git does not support large numbers of refs well -- it ends up slowing everything down. Further discussion here:
http://thread.gmane.org/gmane.comp.version-control.git/182158
Instead, the objects could be packed up into commits, but due to the arbitrary subset problem, talking to remotes would not be as simple as push/pull of some refs. Instead, we'll probably need to create a new commit to pack up exactly what is requested.