Closed simonw closed 2 years ago
This could also help shrink the size of the overall database (see also #12).
One catch: does null
mean "this value did not change" or does it mean "this value was set to null
"?
I'm going to see if I can address that using a bitmap - an additional column, probably called _changed
(though I'm worried that might be confused for a timestamp) which stores an integer which represents which columns changed since the previous item.
A custom Datasette plugin could read that bitmap and use it to show extra information when the table is displayed.
I started my prototype by writing a _changed
column that contained a JSON list of columns, rather than a bitmap: ["AcresBurned", "PercentContained", "ConditionStatement", "Updated", "PersonnelInvolved", "CrewsInvolved", "Helicopters", "Engines", "Dozers"]
I really like this, but it's very verbose and consumes a lot of table space.
So... how about turning that into a lookup table? So the JSON value becomes a key in another table?
This would save on space, and display reasonably well. Running a query for "all versions that had an updated Dozers
column" would be possible but would involve a join against that table plus some JSON shenanigans, which wouldn't be particularly simple for users.
Probably simpler than dealing with a bitmap column though!
Alternative formats for the _changed
column:
table.lookup()
in sqlite-utils
,3,5,7,
style string, which allows for where _changes like '%,3,%
queries - could even add a trigram index?versions_column_changed
tableMaybe the many-to-many relationship would be the best way to handle this? I'm slightly put off because Datasette doesn't have facet by many-to-many yet - I mostly implemented that a few years ago but never finished it, see https://github.com/simonw/datasette/issues/551
After sweating over this decision for a full week, I'm going to try the m2m table and see how well it works.
I may try turning it back into a JSON list for one of the views.
That commit is my initial prototype of a _changed
column which stores a full JSON list of column names.
I'm going to try using m2m instead next.
Since item_versions
currently uses a compound primary key I'm ending up with a m2m table that relates against a compound foreign key, which doesn't work great with Datasette yet. Code looks something like this:
db["item_versions"].insert(
item_version,
pk=("_item", "_version"),
alter=True,
replace=True,
column_order=("_item", "_version", "_commit"),
foreign_keys=(
("_item", "items", "_id"),
("_commit", "commits", "id"),
),
)
if changed_columns:
# Record which columns changed in the changed m2m table
for column in changed_columns:
db["changed"].insert({
"_item": item_id,
"_version": version,
"column": db["columns"].lookup({"name": column})
}, pk=("_item", "_version", "column"), foreign_keys=(
("column", "columns", "id"),
("_item_id", "items", "_id"),
))
I think I'll switch item_versions
to having a regular integer primary key instead to make the m2m
table more predictable.
The changed
table looks like this:
CREATE TABLE [changed] (
[item_version] INTEGER REFERENCES [item_versions]([id]),
[column] INTEGER REFERENCES [columns]([id]),
PRIMARY KEY ([item_version], [column])
);
This makes it difficult to facet by item
in the Datasette interface because you have to first go through item_version
- I can fix that by adding a changed_details
view or similar that makes that join for you.
% sqlite-utils create-view ca-fires-m2m-2.db changed_details '
select
item_versions.id as _item_version_pk,
items._id as _item_pk,
item_versions._version,
columns.name
from
changed
join item_versions on changed.item_version = item_versions.id
join columns on changed.column = columns.id
join items on item_versions._item = items._id'
That prototype worked, but highlighted that without foreign key links on views this interface isn't as useful as I'd like it to be:
I'm tempted to address that problem using a Datasette render_cell()
plugin, which looks out for a table (actually a view) with that name and columns of that name and turns them into hyperlinks to the right place.
Maybe a datasette-git-history-viewer
plugin?
Or... datasette-git-history
, which depends on git-history
and also adds a datasette git-history
CLI command using the new register_commands hook.
Building a Datasette plugin that imports an existing Click CLI tool and re-registers it is proving hard - Click doesn't really want you to do that. I tried this:
from datasette import hookimpl
from git_history.cli import file as git_history_file
@hookimpl
def register_commands(cli):
cli.command(name="git-history")(git_history_file.callback)
But when I run this:
% datasette git-history --help
Usage: datasette git-history [OPTIONS]
Analyze the history of a specific file and write it to SQLite
Options:
--help Show this message and exit.
The options are all missing - which means that the command doesn't actually work. Will need to research this pattern separately.
Figured out a pattern that works: https://til.simonwillison.net/datasette/reuse-click-for-register-commands
Here's the full plugin, adding hyperlinks to that view:
from datasette import hookimpl
import markupsafe
from git_history.cli import cli as git_history_cli
@hookimpl
def register_commands(cli):
cli.add_command(git_history_cli, name="git-history")
foreign_keys = {
# table, column => other-table
("changed_details", "_item_version_pk"): "item_versions",
("changed_details", "_item_pk"): "items",
}
@hookimpl
def render_cell(value, column, table, database, datasette):
if (table, column) in foreign_keys:
other_table = foreign_keys[(table, column)]
return markupsafe.Markup(
'<a href="{href}">{label}</a>'.format(
href=markupsafe.escape(
"{}/{}".format(
datasette.urls.table(database, other_table), value
)
),
label=value,
)
)
Created another view to join _commit_at
date to item_versions
:
% sqlite-utils create-view ca-fires-m2m-2.db item_versions_details '
select commits.commit_at as _commit_at, commits.hash as _commit_hash, item_versions.*
from item_versions join commits on item_versions._commit = commits.id'
I shipped Datasette 0.59.4 with fixes for a couple more bugs that came up with the _item_id
underscore prefixed column names: https://docs.datasette.io/en/stable/changelog.html#v0-59-4
OK, I like the m2m
version of this. I'm going to add documentation and tests and ship it.
I'm tempted to make this the default, and have an option for recording full duplicate records in the item_versions
table.
If this was the default behaviour, what would the option to store full records instead of only the changes be called?
I think --full-versions
This is mostly done, I still need to add documentation and tests - and maybe switch the default behaviour.
I'm testing this against sf-tree-history
now.
(git-history) git-history % time git-history file sf-trees-changed.db ../sf-tree-history/Street_Tree_List.csv --repo ../sf-tree-history --changed --id TreeID --csv
[------------------------------------] 1/247 0%
I turned on WAL mode first so that I could run Datasette against it while it processes:
sqlite3 sf-trees-changed.db vacuum
sqlite-utils enable-wal sf-trees-changed.db
It's still on the first commit and Datasette already shows this:
That's because this CSV file has ~190,000 rows in it, so just the first commit will create 190,000 item versions and 190,000 * 18 = 3,420,000 rows in the item_changed
table.
Be interesting to see what the remaining 248 commits do to the database size.
I restarted it to try out an optimization, but then realized there's a nasty bug: the _version
resets to 1 if you restart it even if there are already commits in the database.
This option - which reduces the number of times we call db["column"].lookup()
by a huge amount - seems to have made a solid difference to performance.
diff --git a/git_history/cli.py b/git_history/cli.py
index 9be75c3..b294623 100644
--- a/git_history/cli.py
+++ b/git_history/cli.py
@@ -170,6 +170,17 @@ def file(
item_id_to_last_full_hash = {}
item_id_to_previous_version = {}
+ column_name_to_id = {}
+
+ def column_id(column):
+ if column not in column_name_to_id:
+ id = db["columns"].lookup(
+ {"namespace": namespace_id, "name": column},
+ foreign_keys=(("namespace", "namespaces", "id"),),
+ )
+ column_name_to_id[column] = id
+ return column_name_to_id[column]
+
for git_commit_at, git_hash, content in iterate_file_versions(
resolved_repo,
resolved_filepath,
@@ -325,24 +336,21 @@ def file(
if changed_columns:
# Record which columns changed in the changed m2m table
- for column in changed_columns:
- db[changed_table].insert(
+ db[changed_table].insert_all(
+ (
{
"item_version": item_version_id,
- "column": db["columns"].lookup(
- {"namespace": namespace_id, "name": column},
- foreign_keys=(
- ("namespace", "namespaces", "id"),
- ),
- ),
- },
- pk=("item_version", "column"),
- foreign_keys=(
- ("item_version", version_table, "_id"),
- ("column", "columns", "id"),
- ("namespace", "namespaces", "id"),
- ),
- )
+ "column": column_id(column),
+ }
+ for column in changed_columns
+ ),
+ pk=("item_version", "column"),
+ foreign_keys=(
+ ("item_version", version_table, "_id"),
+ ("column", "columns", "id"),
+ ("namespace", "namespaces", "id"),
+ ),
+ )
else:
# no --id - so just correct for reserved columns and add item["_commit"]
Now at 52 minutes estimated remaining:
(git-history) git-history % time git-history file sf-trees-changed.db ../sf-tree-history/Street_Tree_List.csv --repo ../sf-tree-history --changed --id TreeID --csv
[####--------------------------------] 34/247 13% 00:52:39
It broke!
[#####-------------------------------] 40/247 16% 00:57:55Error: Commit: 3fb63a99dfab8a75c83d341c67afc9abf484e0c4 - every item must have the --id keys. These items did not:
[
{
"1": "9",
"DPW Maintained": "DPW Maintained",
"Pinus radiata :: Monterey Pine": "Palm (unknown Genus) :: Palm Spp",
"10 10th Ave": "97 12th St",
"Median : Cutout": "Median : Cutout",
"Tree": "Tree",
"DPW": "DPW",
"": "",
"3": "20",
"16": "3X3",
"5992719.40718": "6007166.5822125",
"2114869.71245": "2109668.7679164",
"37.7866097734767": "37.7731533608049",
"-122.468982710496": "-122.418629924651",
"(37.7866097734767, -122.468982710496)": "(37.7731533608049, -122.418629924651)",
"11": "19",
"6": "2",
"54": "28853"
},
...
https://github.com/simonw/sf-tree-history/commit/3fb63a99dfab8a75c83d341c67afc9abf484e0c4 is the commit in question - it says "sort CSV before storing" - so maybe I need to only process commits AFTER that point?
(Original title: Option to store only columns that have changed in item_versions)
When browsing a list of item versions like this one it's difficult to tell at a glance which rows have changed since the previous version:
It would be neat if there was a mode that could only store values in the versions table for columns that have changed since the last version.