Closed simonw closed 3 years ago
Design:
git-convert database.db filename.json
This runs through the history of that file (assuming it's a JSON flat list already) and populates a single table:
item
Unless you use --name incident
or similar to change that name.
The table has a single row for every unique item spotted in every version of the file, with additional git_commit_at
and git_hash
columns for the timestamps and hashes from the git commit.
But.. if you add --id name_of_column
it behaves differently: it creates both item
and item_history
where item
contains the most recent version of each one (de-duped by primary key) and item_history
has additional history, with a numeric version
column.
TODO: decide how to handle conflicts between columns in the JSON and the columns that the tool adds.
First version just does the git-convert database.db filename.json
thing - next I'll implement the --id
option.
Idea: a CLI tool that can iterate through every version of a file captured using Git scraping and convert that into a SQLite database with tables that track data captured over time.
Inspiration includes https://github.com/simonw/pge-outages and https://github.com/adolph/getIncidentsGit/blob/main/incidents.json
It iterates through every version of a file ever captured
It runs a Python expression supplied on the command-line (like sqlite-utils convert) to convert each copy of the file into a JSON array of flat items - with a usable primary key
It records each row in a SQLite table.
It can either say "most recent record wins" and
.insert(..., replace=True)
the record, or it can do something smarter to track versions, maybe filing a new copy with an extra column with a version number in it and a column for the datetime of the commit that saw that version? Maybe even a most_recent column guaranteed to be 1 for only one of the records, so you can query by that? Or maintain one table that's just the most recent and a full history table with all of the versions - that way viewing the "most recent" table can include a foreign key relationship to each item which lets you see the full history.It tracks the commit that it got to, so when you run it again it can skip everything up to that point.