simonw / git-history

Tools for analyzing Git history using SQLite
Apache License 2.0
191 stars 18 forks source link

Initial tool design #1

Closed simonw closed 3 years ago

simonw commented 3 years ago

Idea: a CLI tool that can iterate through every version of a file captured using Git scraping and convert that into a SQLite database with tables that track data captured over time.

Inspiration includes https://github.com/simonw/pge-outages and https://github.com/adolph/getIncidentsGit/blob/main/incidents.json

It iterates through every version of a file ever captured

It runs a Python expression supplied on the command-line (like sqlite-utils convert) to convert each copy of the file into a JSON array of flat items - with a usable primary key

It records each row in a SQLite table.

It can either say "most recent record wins" and .insert(..., replace=True) the record, or it can do something smarter to track versions, maybe filing a new copy with an extra column with a version number in it and a column for the datetime of the commit that saw that version? Maybe even a most_recent column guaranteed to be 1 for only one of the records, so you can query by that? Or maintain one table that's just the most recent and a full history table with all of the versions - that way viewing the "most recent" table can include a foreign key relationship to each item which lets you see the full history.

It tracks the commit that it got to, so when you run it again it can skip everything up to that point.

simonw commented 3 years ago

Design:

git-convert database.db filename.json

This runs through the history of that file (assuming it's a JSON flat list already) and populates a single table:

item

Unless you use --name incident or similar to change that name.

The table has a single row for every unique item spotted in every version of the file, with additional git_commit_at and git_hash columns for the timestamps and hashes from the git commit.

But.. if you add --id name_of_column it behaves differently: it creates both item and item_history where item contains the most recent version of each one (de-duped by primary key) and item_history has additional history, with a numeric version column.

TODO: decide how to handle conflicts between columns in the JSON and the columns that the tool adds.

simonw commented 3 years ago

First version just does the git-convert database.db filename.json thing - next I'll implement the --id option.