simonw / git-history

Tools for analyzing Git history using SQLite
Apache License 2.0
191 stars 18 forks source link

Default to only storing columns that have changed in item_version #21

Closed simonw closed 2 years ago

simonw commented 3 years ago

(Original title: Option to store only columns that have changed in item_versions)

When browsing a list of item versions like this one it's difficult to tell at a glance which rows have changed since the previous version:

image

It would be neat if there was a mode that could only store values in the versions table for columns that have changed since the last version.

simonw commented 3 years ago

This could also help shrink the size of the overall database (see also #12).

One catch: does null mean "this value did not change" or does it mean "this value was set to null"?

I'm going to see if I can address that using a bitmap - an additional column, probably called _changed (though I'm worried that might be confused for a timestamp) which stores an integer which represents which columns changed since the previous item.

A custom Datasette plugin could read that bitmap and use it to show extra information when the table is displayed.

simonw commented 3 years ago

I started my prototype by writing a _changed column that contained a JSON list of columns, rather than a bitmap: ["AcresBurned", "PercentContained", "ConditionStatement", "Updated", "PersonnelInvolved", "CrewsInvolved", "Helicopters", "Engines", "Dozers"]

I really like this, but it's very verbose and consumes a lot of table space.

So... how about turning that into a lookup table? So the JSON value becomes a key in another table?

This would save on space, and display reasonably well. Running a query for "all versions that had an updated Dozers column" would be possible but would involve a join against that table plus some JSON shenanigans, which wouldn't be particularly simple for users.

Probably simpler than dealing with a bitmap column though!

simonw commented 3 years ago

Alternative formats for the _changed column:

Maybe the many-to-many relationship would be the best way to handle this? I'm slightly put off because Datasette doesn't have facet by many-to-many yet - I mostly implemented that a few years ago but never finished it, see https://github.com/simonw/datasette/issues/551

simonw commented 2 years ago

After sweating over this decision for a full week, I'm going to try the m2m table and see how well it works.

I may try turning it back into a JSON list for one of the views.

simonw commented 2 years ago

That commit is my initial prototype of a _changed column which stores a full JSON list of column names.

I'm going to try using m2m instead next.

simonw commented 2 years ago

Since item_versions currently uses a compound primary key I'm ending up with a m2m table that relates against a compound foreign key, which doesn't work great with Datasette yet. Code looks something like this:

                    db["item_versions"].insert(
                        item_version,
                        pk=("_item", "_version"),
                        alter=True,
                        replace=True,
                        column_order=("_item", "_version", "_commit"),
                        foreign_keys=(
                            ("_item", "items", "_id"),
                            ("_commit", "commits", "id"),
                        ),
                    )

                    if changed_columns:
                        # Record which columns changed in the changed m2m table
                        for column in changed_columns:
                            db["changed"].insert({
                                "_item": item_id,
                                "_version": version,
                                "column": db["columns"].lookup({"name": column})
                            }, pk=("_item", "_version", "column"), foreign_keys=(
                                ("column", "columns", "id"),
                                ("_item_id", "items", "_id"),
                            ))

I think I'll switch item_versions to having a regular integer primary key instead to make the m2m table more predictable.

simonw commented 2 years ago

The changed table looks like this:

CREATE TABLE [changed] (
   [item_version] INTEGER REFERENCES [item_versions]([id]),
   [column] INTEGER REFERENCES [columns]([id]),
   PRIMARY KEY ([item_version], [column])
);

This makes it difficult to facet by item in the Datasette interface because you have to first go through item_version - I can fix that by adding a changed_details view or similar that makes that join for you.

simonw commented 2 years ago
% sqlite-utils create-view ca-fires-m2m-2.db changed_details '
select
  item_versions.id as _item_version_pk,
  items._id as _item_pk,
  item_versions._version,
  columns.name
from
  changed
  join item_versions on changed.item_version = item_versions.id
  join columns on changed.column = columns.id
  join items on item_versions._item = items._id'
simonw commented 2 years ago

That prototype worked, but highlighted that without foreign key links on views this interface isn't as useful as I'd like it to be:

image
simonw commented 2 years ago

I'm tempted to address that problem using a Datasette render_cell() plugin, which looks out for a table (actually a view) with that name and columns of that name and turns them into hyperlinks to the right place.

Maybe a datasette-git-history-viewer plugin?

Or... datasette-git-history, which depends on git-history and also adds a datasette git-history CLI command using the new register_commands hook.

simonw commented 2 years ago

Building a Datasette plugin that imports an existing Click CLI tool and re-registers it is proving hard - Click doesn't really want you to do that. I tried this:

from datasette import hookimpl
from git_history.cli import file as git_history_file

@hookimpl
def register_commands(cli):
    cli.command(name="git-history")(git_history_file.callback)

But when I run this:

 % datasette git-history --help     
Usage: datasette git-history [OPTIONS]

  Analyze the history of a specific file and write it to SQLite

Options:
  --help  Show this message and exit.

The options are all missing - which means that the command doesn't actually work. Will need to research this pattern separately.

simonw commented 2 years ago

Figured out a pattern that works: https://til.simonwillison.net/datasette/reuse-click-for-register-commands

simonw commented 2 years ago

Here's the full plugin, adding hyperlinks to that view:

from datasette import hookimpl
import markupsafe
from git_history.cli import cli as git_history_cli

@hookimpl
def register_commands(cli):
    cli.add_command(git_history_cli, name="git-history")

foreign_keys = {
    # table, column => other-table
    ("changed_details", "_item_version_pk"): "item_versions",
    ("changed_details", "_item_pk"): "items",
}

@hookimpl
def render_cell(value, column, table, database, datasette):
    if (table, column) in foreign_keys:
        other_table = foreign_keys[(table, column)]
        return markupsafe.Markup(
            '<a href="{href}">{label}</a>'.format(
                href=markupsafe.escape(
                    "{}/{}".format(
                        datasette.urls.table(database, other_table), value
                    )
                ),
                label=value,
            )
        )
simonw commented 2 years ago

Created another view to join _commit_at date to item_versions:

% sqlite-utils create-view ca-fires-m2m-2.db item_versions_details '
select commits.commit_at as _commit_at, commits.hash as _commit_hash, item_versions.*
from item_versions join commits on item_versions._commit = commits.id'
simonw commented 2 years ago

I shipped Datasette 0.59.4 with fixes for a couple more bugs that came up with the _item_id underscore prefixed column names: https://docs.datasette.io/en/stable/changelog.html#v0-59-4

simonw commented 2 years ago

OK, I like the m2m version of this. I'm going to add documentation and tests and ship it.

simonw commented 2 years ago

I'm tempted to make this the default, and have an option for recording full duplicate records in the item_versions table.

simonw commented 2 years ago

If this was the default behaviour, what would the option to store full records instead of only the changes be called?

I think --full-versions

simonw commented 2 years ago

This is mostly done, I still need to add documentation and tests - and maybe switch the default behaviour.

simonw commented 2 years ago

I'm testing this against sf-tree-history now.

(git-history) git-history % time git-history file sf-trees-changed.db ../sf-tree-history/Street_Tree_List.csv --repo ../sf-tree-history --changed --id TreeID --csv
  [------------------------------------]  1/247    0%

I turned on WAL mode first so that I could run Datasette against it while it processes:

sqlite3 sf-trees-changed.db vacuum
sqlite-utils enable-wal sf-trees-changed.db 

It's still on the first commit and Datasette already shows this:

Banners_and_Alerts_and_sf-trees-changed

That's because this CSV file has ~190,000 rows in it, so just the first commit will create 190,000 item versions and 190,000 * 18 = 3,420,000 rows in the item_changed table.

Be interesting to see what the remaining 248 commits do to the database size.

simonw commented 2 years ago

I restarted it to try out an optimization, but then realized there's a nasty bug: the _version resets to 1 if you restart it even if there are already commits in the database.

simonw commented 2 years ago

This option - which reduces the number of times we call db["column"].lookup() by a huge amount - seems to have made a solid difference to performance.

diff --git a/git_history/cli.py b/git_history/cli.py
index 9be75c3..b294623 100644
--- a/git_history/cli.py
+++ b/git_history/cli.py
@@ -170,6 +170,17 @@ def file(
     item_id_to_last_full_hash = {}
     item_id_to_previous_version = {}

+    column_name_to_id = {}
+
+    def column_id(column):
+        if column not in column_name_to_id:
+            id = db["columns"].lookup(
+                {"namespace": namespace_id, "name": column},
+                foreign_keys=(("namespace", "namespaces", "id"),),
+            )
+            column_name_to_id[column] = id
+        return column_name_to_id[column]
+
     for git_commit_at, git_hash, content in iterate_file_versions(
         resolved_repo,
         resolved_filepath,
@@ -325,24 +336,21 @@ def file(

                     if changed_columns:
                         # Record which columns changed in the changed m2m table
-                        for column in changed_columns:
-                            db[changed_table].insert(
+                        db[changed_table].insert_all(
+                            (
                                 {
                                     "item_version": item_version_id,
-                                    "column": db["columns"].lookup(
-                                        {"namespace": namespace_id, "name": column},
-                                        foreign_keys=(
-                                            ("namespace", "namespaces", "id"),
-                                        ),
-                                    ),
-                                },
-                                pk=("item_version", "column"),
-                                foreign_keys=(
-                                    ("item_version", version_table, "_id"),
-                                    ("column", "columns", "id"),
-                                    ("namespace", "namespaces", "id"),
-                                ),
-                            )
+                                    "column": column_id(column),
+                                }
+                                for column in changed_columns
+                            ),
+                            pk=("item_version", "column"),
+                            foreign_keys=(
+                                ("item_version", version_table, "_id"),
+                                ("column", "columns", "id"),
+                                ("namespace", "namespaces", "id"),
+                            ),
+                        )

         else:
             # no --id - so just correct for reserved columns and add item["_commit"]

Now at 52 minutes estimated remaining:

(git-history) git-history % time git-history file sf-trees-changed.db ../sf-tree-history/Street_Tree_List.csv --repo ../sf-tree-history --changed --id TreeID --csv
  [####--------------------------------]  34/247   13%  00:52:39
simonw commented 2 years ago

It broke!

  [#####-------------------------------]  40/247   16%  00:57:55Error: Commit: 3fb63a99dfab8a75c83d341c67afc9abf484e0c4 - every item must have the --id keys. These items did not:
[
    {
        "1": "9",
        "DPW Maintained": "DPW Maintained",
        "Pinus radiata :: Monterey Pine": "Palm (unknown Genus) :: Palm Spp",
        "10 10th Ave": "97 12th St",
        "Median : Cutout": "Median : Cutout",
        "Tree": "Tree",
        "DPW": "DPW",
        "": "",
        "3": "20",
        "16": "3X3",
        "5992719.40718": "6007166.5822125",
        "2114869.71245": "2109668.7679164",
        "37.7866097734767": "37.7731533608049",
        "-122.468982710496": "-122.418629924651",
        "(37.7866097734767, -122.468982710496)": "(37.7731533608049, -122.418629924651)",
        "11": "19",
        "6": "2",
        "54": "28853"
    },
...

https://github.com/simonw/sf-tree-history/commit/3fb63a99dfab8a75c83d341c67afc9abf484e0c4 is the commit in question - it says "sort CSV before storing" - so maybe I need to only process commits AFTER that point?

simonw commented 2 years ago

Documentation: https://github.com/simonw/git-history/blob/416566bf868aba2033db19fe09631d8162d4bd9b/README.md#track-the-history-of-individual-items-using-ids