seeM / git-to-sqlite

Save data from Git repos to a SQLite database
Apache License 2.0
1 stars 0 forks source link

Initial investigations #1

Open seeM opened 1 year ago

seeM commented 1 year ago

I'm starting this project by considering whether we should extract a git-to-sqlite command-line tool from Codal – a project I've been working on that lets you chat with any GitHub.

My initial feeling is "yes" (that's why I created the repo!) but I was inspired by Simon Willison and am going to try to document as much of my thinking here as possible.

Some initial questions:

seeM commented 1 year ago

I'm finding Nick Farina's Git Is Simpler Than You Think useful to understand Git's own data model.

A git repository comprises the following objects: commits, trees, blobs, and tags.

Object database

Git stores data in a file-based object database at .git/objects/.

Each object is stored at a file based on its sha1 hash. Files are nested under a sub-directory. For example, if the hash were 12345, the file would be stored at .git/objects/12/345. This is to avoid storing too many files in the same directory (perhaps there are/were file-system limits on the maximum number of files in a directory?).

File contents are zlib compressed.

Commits

A commit has:

Trees

A tree has a set of references to blobs (files) and other trees (folders). For each blob/tree it has:

Blobs

A blob contains the contents of a file.

Branches

A branch is a file named after the name of the branch (e.g. .git/refs/heads/master), and whose contents reference the hash of a commit at the tip of the branch.

The currently active branch/commit is stored in the file .git/HEAD.

seeM commented 1 year ago

And here are some notes from Git Magic - Chapter 8. Secrets Revealed. I'll only include things that weren't explained by the previous article, or that I misunderstood.

Index

Git maintains an index of all files' size, creation time, and last modification time. Git uses this to determine whether a file has changed and thus whether it needs to be reread.

Object database

The hash is calculated using the object type, length in bytes, and its contents.

If you have two files with identical contents, git only stores one blob.

Trees

Each tuple contains the file type (normal files, executables, symlinks, directories), filename, and hash.

seeM commented 1 year ago

Does every commit creates a new tree object in git?

Not necessarily. Just as a blob may be reused (if we create two files with identical contents), a tree may also be reused if we manage to work our way back to the exact same file/folder structure and contents – since the hash of a tree is based on its own contents.


I just realized a huge downside to the schema I used in Codal. Here's a refresher of the relevant parts of that:

The downside is that we unnecessarily store data even when a file's contents don't change. We should probably add a blob object, and make document version's reference those:

The tree idea takes this a bit further. In Codal, we create a new document version even if the contents don't change. In Git, we don't create a new tree for a folder whose contents don't change.

I wonder how much storage trees save in practice?

seeM commented 1 year ago

If we store our data in SQLite following git's existing data model, is it possible to expose a more user-friendly interface via a SQL view?

seeM commented 1 year ago
classDiagram
    class commits {
      +hash
      +tree
      +parent
      +author_name
      +author_email
      +authored_datetime
      +committer_name
      +committer_email
      +committed_datetime
      +message
    }
    class trees {
      +hash

    }
    class blobs {
      +hash
    }