Git Objects: commits within trees.

jbenet commented 10 years ago

(This is a realization i made while designing IPFS. Explaining it here because I'll need to refer to it. Excuse the meandering between too little and too much exposition. Not clear to me what I should and shouldn't assume the readers know).

Traditionally, Git commits point to trees (and other commits). Trees don't point to commits. This is a useful design as it keeps the commit tree somewhat separate from the file objects. Walking/manipulating the commit dag is simple.

This design choice becomes problematic when handling submodules, or the repositories themselves. Using submodules is notoriously annoying, because the model makes assumptions on the workflow (submodules are other things from some other space). Repos themselves aren't tracked within git. ("what!?" you say. "What does that even mean!?")

Repository objects

Imagine the Repository as a first-class git object, something like this:

head a906cb2a4a904a152e80877d4088654daad0c859 master
head 8f94139338f9404f26296befa88755fc2598c289 dev
tag 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0 v1.2.0
tag 495d50ddb02dfb149c808cf5491f9ba696285a92 v1.1.0
tag 31ee3269618939b37049251d79e9edd52c67d051 v1.0.0

Mapping ref names to commit hashes. A collection of entries, with an entry format like:

<ref type> <commit hash> <name>

This is really just a more complicated tree object. Example tree:

100644 blob a906cb2a4a904a152e80877d4088654daad0c859      README
100644 blob 8f94139338f9404f26296befa88755fc2598c289      Rakefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0      lib

The tree is a collection of entries, with an entry format like:

<unix permissions> <obj type> <obj hash> <obj name>

(From now on, ignore the unix perms.) These are the same! So a repo object is really just a privileged tree object that gets to point to commits. That seems to me like a bad lack of generality. What if all trees could point to commits?

You'd unlock a host of new workflow patterns within the object commit graph, and make submodules first-class things.

submodules are just repos + a pointer[1] to its origin, for updates.
version each file independently (not heresy, there are use cases for this).
repos themselves can be tracked. When you manipulate a repo (update the index, etc), those changes can be versioned too (reflog).
reflog becomes just a meta commit graph.

It's ~~turtles~~ commits all the way down.

Takeaway: let trees point to commits.

[1] saying pointer here, and not url, because this can be a hash. Why is it a url? Because git is built to operate within a single machine, with a completely separate blobstore from all other git blobstores. If you change that -- if you make one blobstore to store them all -- then the url can be a hash to another object in the blobstore. (italics on hash because it's not a hash of the value of the object (not content-addressed). It has to be a symlink (mutable object).

grawity commented 9 years ago

Wait wait wait, aren't Git submodules exactly that – tree entries pointing to commits?

rain ~/src/gnome/shell master
$ git ls-tree @:src
...
100644 blob 03709d6051ea5affd0381ac86db40c28849ada2b    gtkmenutrackeritem.h
160000 commit e14dbe8aa6dfaeea4a9f3405cf2f3e238e88623b  gvc
040000 tree 6c316fb521870a39d902aceebf2c4c3e0982f77d    hotplug-sniffer
100644 blob 4070482e18c9a4760cf33230036e9c3268fc5498    main.c
...

jbenet commented 9 years ago

@grawity almost!! try:

git cat-file -p e14dbe8aa6dfaeea4a9f3405cf2f3e238e88623b

jbenet commented 9 years ago

So, what's going on? The commit's not actually an object in the repo's object graph. It's somewhere else.

The submodule is stored as a commit, yes but it doesnt quite work, because a submodule -- as git stands today -- is more then a commit. it's a commit hash + another repository (meaning, a repository url). this is why we need the extra .gitmodules file, etc.

The submodule repo information is not stored as part of the commit because storing addresses (the repo url) in git objects doesn't make sense-- repos and locations are related but not the same thing (as they should be), because the repo changes location. Also note that the submodule's object repository lives within the .git of the submodule. It's a hack, because this is one of the things that .svn's model had sort of annoyingly right (right conceptually, but incredibly annoying because it littered .svn everywhere). The result is that submodules are this weird halfway thing.

(If you've seen IPFS, you'll see where I'm going with all this. namely, merging all repos into one.)

grawity commented 9 years ago

Yes, Git makes an exception for submodules in that they still have independent object stores (and everything else), so git fsck remains silent about a missing commit object.

(Also, recent SVN versions only have .svn in the root directory of a checkout.)

jbenet / random-ideas

Git Objects: commits within trees. #4

Repository objects