Initial investigations - Githubissues

seeM commented 1 year ago

I'm starting this project by considering whether we should extract a git-to-sqlite command-line tool from Codal – a project I've been working on that lets you chat with any GitHub.

My initial feeling is "yes" (that's why I created the repo!) but I was inspired by Simon Willison and am going to try to document as much of my thinking here as possible.

Some initial questions:

What are our constraints?
- We may want to store files efficiently (e.g. via diffs) but this may also not be necessary. Perhaps we could do some early experiments to determine that.
Do we need this or can we use something that already exists?
- github-to-sqlite is similar, but IIUC doesn't extract the files and their contents.
What should the schema look like?

seeM commented 1 year ago

I'm finding Nick Farina's Git Is Simpler Than You Think useful to understand Git's own data model.

A git repository comprises the following objects: commits, trees, blobs, and tags.

Object database

Git stores data in a file-based object database at .git/objects/.

Each object is stored at a file based on its sha1 hash. Files are nested under a sub-directory. For example, if the hash were 12345, the file would be stored at .git/objects/12/345. This is to avoid storing too many files in the same directory (perhaps there are/were file-system limits on the maximum number of files in a directory?).

File contents are zlib compressed.

Commits

A commit has:

a reference (sha1 hash) to a tree
a reference (sha1 hash) to the parent (previous) commit
the author's name, email address, and authored datetime
the committer's name, email address, and committed datetime
a commit message

Trees

A tree has a set of references to blobs (files) and other trees (folders). For each blob/tree it has:

its type (blob/tree)
its sha1 hash
its name

Blobs

A blob contains the contents of a file.

Branches

A branch is a file named after the name of the branch (e.g. .git/refs/heads/master), and whose contents reference the hash of a commit at the tip of the branch.

The currently active branch/commit is stored in the file .git/HEAD.

seeM commented 1 year ago

And here are some notes from Git Magic - Chapter 8. Secrets Revealed. I'll only include things that weren't explained by the previous article, or that I misunderstood.

Index

Git maintains an index of all files' size, creation time, and last modification time. Git uses this to determine whether a file has changed and thus whether it needs to be reread.

Object database

The hash is calculated using the object type, length in bytes, and its contents.

If you have two files with identical contents, git only stores one blob.

Trees

Each tuple contains the file type (normal files, executables, symlinks, directories), filename, and hash.

seeM commented 1 year ago

Does every commit creates a new tree object in git?

Not necessarily. Just as a blob may be reused (if we create two files with identical contents), a tree may also be reused if we manage to work our way back to the exact same file/folder structure and contents – since the hash of a tree is based on its own contents.

I just realized a huge downside to the schema I used in Codal. Here's a refresher of the relevant parts of that:

A document represents a path
A document version represents the contents of a document at a given commit
A commit is unaware of either of the above

The downside is that we unnecessarily store data even when a file's contents don't change. We should probably add a blob object, and make document version's reference those:

A blob represents file contents (independent of the commit/version)
A document version tells us the contents of a file (its blob) at a given path (its document) and at a given commit

The tree idea takes this a bit further. In Codal, we create a new document version even if the contents don't change. In Git, we don't create a new tree for a folder whose contents don't change.

I wonder how much storage trees save in practice?

seeM commented 1 year ago

If we store our data in SQLite following git's existing data model, is it possible to expose a more user-friendly interface via a SQL view?

seeM commented 1 year ago

classDiagram
    class commits {
      +hash
      +tree
      +parent
      +author_name
      +author_email
      +authored_datetime
      +committer_name
      +committer_email
      +committed_datetime
      +message
    }
    class trees {
      +hash

    }
    class blobs {
      +hash
    }

seeM / git-to-sqlite

Initial investigations #1

Object database

Commits

Trees

Blobs

Branches

Index

Object database

Trees