simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.45k stars 676 forks source link

Add ?_extra= mechanism for requesting extra properties in JSON #262

Open simonw opened 6 years ago

simonw commented 6 years ago

Datasette views currently work by creating a set of data that should be returned as JSON, then defining an additional, optional template_data() function which is called if the view is being rendered as HTML.

This template_data() function calculates extra template context variables which are necessary for the HTML view but should not be included in the JSON.

Example of how that is used today: https://github.com/simonw/datasette/blob/2b79f2bdeb1efa86e0756e741292d625f91cb93d/datasette/views/table.py#L672-L704

With features like Facets in #255 I'm beginning to want to move more items into the template_data() - in the case of facets it's the suggested_facets array. This saves that feature from being calculated (involving several SQL queries) for the JSON case where it is unlikely to be used.

But... as an API user, I want to still optionally be able to access that information.

Solution: Add a ?_extra=suggested_facets&_extra=table_metadata argument which can be used to optionally request additional blocks to be added to the JSON API.

Then redefine as many of the current template_data() features as extra arguments instead, and teach Datasette to return certain extras by default when rendering templates.

This could allow the JSON representation to be slimmed down further (removing e.g. the table_definition and view_definition keys) while still making that information available to API users who need it.

simonw commented 6 years ago

Idea: ?_extra=sqllog could output a lot of every individual SQL statement that was executed in order to generate the page - useful for seeing how foreign key expansion and faceting actually works.

simonw commented 4 years ago

Idea: ?_extra=sqllog could output a lot of every individual SQL statement that was executed in order to generate the page - useful for seeing how foreign key expansion and faceting actually works.

I built a version of that a while ago as the ?_trace=1 argument.

simonw commented 4 years ago

Are there any interesting use-cases for a plugin hook that allows plugins to define their own ?_extra= blocks?

simonw commented 3 years ago

Just realized I added an undocumented ?_extras= option to the row view years ago and forgot about it - it's not even documented. Added in a30c5b220c15360d575e94b0e67f3255e120b916 - https://latest.datasette.io/fixtures/attraction_characteristic/2.json?_extras=foreign_key_tables

That will need to be made consistent with the new mechanism. I think ?_extra=a&_extra=b is more consistent with other Datasette features (like ?_facet=col1&_facet=col2) but potentially quite verbose.

So I could support ?_extra=a,b,c as an alternative allowed syntax, or I could allow ?_extra=single and ?_extras=comma,separated.

I think I prefer allowing commas in ?_extra=.

simonw commented 3 years ago

In the documentation for ?_extra= I think I'll emphasize the comma-separated version of it. Also: there will be ?_extra= values which act as aliases for collection combinations - e.g. ?_extra=full will toggle everything.

simonw commented 3 years ago

I think it's worth having a plugin hook for this - it can be same hook that is used internally. Maybe register_extra - it lets you return one or more extra implementations, each with a name and an async function that gets called.

Things like suggested facets will become register_extra hooks. Maybe actual facets too?

simonw commented 3 years ago

This is now blocking https://github.com/simonw/datasette-graphql/issues/61 because that issue needs a way to turn off suggested facets when retrieving the results of a table query.

simonw commented 3 years ago

I think I should prioritize the facets component of this, since that could have significant performance wins while also supporting datasette-graphql.

simonw commented 2 years ago

This is relevant to the big refactor in:

simonw commented 2 years ago

I spotted in https://github.com/simonw/datasette/issues/1719#issuecomment-1108888494 that there's actually already an undocumented implementation of ?_extras=foreign_key_tables - https://latest.datasette.io/fixtures/simple_primary_key/1.json?_extras=foreign_key_tables

I added that feature all the way back in November 2017! https://github.com/simonw/datasette/commit/a30c5b220c15360d575e94b0e67f3255e120b916

simonw commented 1 year ago

As suggested in this issue:

There are three parts of the Datasette API that need to support extras:

There are two other pages I should consider though:

simonw commented 1 year ago

I'm not actually too happy about how /fixtures.json currently entirely changes shape based on whether or not you pass a ?sql= argument to it.

Maybe I can fix that disparity with extras too?

The list of tables you see on /fixtures.json without the ?sql= could become another extra. The HTML version of that page could know to request that extra by default.

This would also support running a SQL query but also returning a list of tables - which can be useful for building a SQL editor interface which hints at the tables that are available to the user - or even for generating the configuration needed by the CodeMirror editor's SQL completion, added in:

simonw commented 1 year ago

I'm tempted NOT to document the JSON for the /.json page, simply because I'm not at all convinced that the current homepage design is the best possible use of that space - and I'd like to reserve the opportunity to redesign that in e.g. Datasette 1.1 without it being a breaking change to the documented JSON API.

simonw commented 1 year ago

Thinking about ?_extra= values just for the table JSON. The default shape will look like this:

{
  "ok": true,
  "rows": [{"id": 1, "name": "Name"}],
  "next": null,
}

The table extras could be:

Stuff currently in https://latest.datasette.io/fixtures/facetable.json that is not yet covered by the above:

  "database": "fixtures",
  "table": "facetable",
  "is_view": false,
  "human_description_en": "where id = 1",
  "private": false,
  "allow_execute_sql": true,
  "query_ms": 16.749476999393664,

I'm tempted to bundle database, table, is_view and human_description_en into one (not sure what to call it though, perhaps display_details?) - and then drop allow_execute_sql entirely and have private and query_ms as their own named extras.

simonw commented 1 year ago

Or maybe have a permissions extra which includes allow_execute_sql and private? Could anything else go in there?

simonw commented 1 year ago

In most cases, the ?_extra=xxx name exactly corresponds to the additional key that is added to the JSON.

?_facet=... is one example of a query string argument that causes an extra key - "facet_results" - to be added to the JSON even though it wasn't requested by name in a ?_extra=.

Am I OK with that? I think so.

Related issue:

Actually there's an edge-case here that's worth considering: it's possible to use metadata to set default facets for a table. If you do this for a table, then .json for that table will always calculate and return those facets - which may be an expensive and unnecessary operation.

So maybe we don't include facet_results in the JSON unless explicitly asked for in that case, but have a rule that ?_facet implies ?_extra=facet_results.

simonw commented 1 year ago

I'm going to write code which parses ?_extra= in the comma separated or multiple parameter format and then looks up functions in a dictionary.

It will return an error if you ask for an extra that does not exist.

simonw commented 1 year ago

Got first prototype working using asyncinject and it's pretty nice:

diff --git a/datasette/views/table.py b/datasette/views/table.py
index ad45ecd3..c8690b22 100644
--- a/datasette/views/table.py
+++ b/datasette/views/table.py
@@ -2,6 +2,7 @@ import asyncio
 import itertools
 import json

+from asyncinject import Registry
 import markupsafe

 from datasette.plugins import pm
@@ -538,57 +539,60 @@ class TableView(DataView):
         # Execute the main query!
         results = await db.execute(sql, params, truncate=True, **extra_args)

-        # Calculate the total count for this query
-        count = None
-        if (
-            not db.is_mutable
-            and self.ds.inspect_data
-            and count_sql == f"select count(*) from {table_name} "
-        ):
-            # We can use a previously cached table row count
-            try:
-                count = self.ds.inspect_data[database_name]["tables"][table_name][
-                    "count"
-                ]
-            except KeyError:
-                pass
-
-        # Otherwise run a select count(*) ...
-        if count_sql and count is None and not nocount:
-            try:
-                count_rows = list(await db.execute(count_sql, from_sql_params))
-                count = count_rows[0][0]
-            except QueryInterrupted:
-                pass
-
-        # Faceting
-        if not self.ds.setting("allow_facet") and any(
-            arg.startswith("_facet") for arg in request.args
-        ):
-            raise BadRequest("_facet= is not allowed")
+        # Resolve extras
+        extras = _get_extras(request)
+        if request.args.getlist("_facet"):
+            extras.add("facet_results")

-        # pylint: disable=no-member
-        facet_classes = list(
-            itertools.chain.from_iterable(pm.hook.register_facet_classes())
-        )
-        facet_results = {}
-        facets_timed_out = []
-        facet_instances = []
-        for klass in facet_classes:
-            facet_instances.append(
-                klass(
-                    self.ds,
-                    request,
-                    database_name,
-                    sql=sql_no_order_no_limit,
-                    params=params,
-                    table=table_name,
-                    metadata=table_metadata,
-                    row_count=count,
-                )
+        async def extra_count():
+            # Calculate the total count for this query
+            count = None
+            if (
+                not db.is_mutable
+                and self.ds.inspect_data
+                and count_sql == f"select count(*) from {table_name} "
+            ):
+                # We can use a previously cached table row count
+                try:
+                    count = self.ds.inspect_data[database_name]["tables"][table_name][
+                        "count"
+                    ]
+                except KeyError:
+                    pass
+
+            # Otherwise run a select count(*) ...
+            if count_sql and count is None and not nocount:
+                try:
+                    count_rows = list(await db.execute(count_sql, from_sql_params))
+                    count = count_rows[0][0]
+                except QueryInterrupted:
+                    pass
+            return count
+
+        async def facet_instances(extra_count):
+            facet_instances = []
+            facet_classes = list(
+                itertools.chain.from_iterable(pm.hook.register_facet_classes())
             )
+            for facet_class in facet_classes:
+                facet_instances.append(
+                    facet_class(
+                        self.ds,
+                        request,
+                        database_name,
+                        sql=sql_no_order_no_limit,
+                        params=params,
+                        table=table_name,
+                        metadata=table_metadata,
+                        row_count=extra_count,
+                    )
+                )
+            return facet_instances
+
+        async def extra_facet_results(facet_instances):
+            facet_results = {}
+            facets_timed_out = []

-        async def execute_facets():
             if not nofacet:
                 # Run them in parallel
                 facet_awaitables = [facet.facet_results() for facet in facet_instances]
@@ -607,9 +611,13 @@ class TableView(DataView):
                         facet_results[key] = facet_info
                     facets_timed_out.extend(instance_facets_timed_out)

-        suggested_facets = []
+            return {
+                "results": facet_results,
+                "timed_out": facets_timed_out,
+            }

-        async def execute_suggested_facets():
+        async def extra_suggested_facets(facet_instances):
+            suggested_facets = []
             # Calculate suggested facets
             if (
                 self.ds.setting("suggest_facets")
@@ -624,8 +632,15 @@ class TableView(DataView):
                 ]
                 for suggest_result in await gather(*facet_suggest_awaitables):
                     suggested_facets.extend(suggest_result)
+            return suggested_facets
+
+        # Faceting
+        if not self.ds.setting("allow_facet") and any(
+            arg.startswith("_facet") for arg in request.args
+        ):
+            raise BadRequest("_facet= is not allowed")

-        await gather(execute_facets(), execute_suggested_facets())
+        # pylint: disable=no-member

         # Figure out columns and rows for the query
         columns = [r[0] for r in results.description]
@@ -732,17 +747,56 @@ class TableView(DataView):
             rows = rows[:page_size]

         # human_description_en combines filters AND search, if provided
-        human_description_en = filters.human_description_en(
-            extra=extra_human_descriptions
-        )
+        async def extra_human_description_en():
+            human_description_en = filters.human_description_en(
+                extra=extra_human_descriptions
+            )
+            if sort or sort_desc:
+                human_description_en = " ".join(
+                    [b for b in [human_description_en, sorted_by] if b]
+                )
+            return human_description_en

         if sort or sort_desc:
             sorted_by = "sorted by {}{}".format(
                 (sort or sort_desc), " descending" if sort_desc else ""
             )
-            human_description_en = " ".join(
-                [b for b in [human_description_en, sorted_by] if b]
-            )
+
+        async def extra_next_url():
+            return next_url
+
+        async def extra_columns():
+            return columns
+
+        async def extra_primary_keys():
+            return pks
+
+        registry = Registry(
+            extra_count,
+            extra_facet_results,
+            extra_suggested_facets,
+            facet_instances,
+            extra_human_description_en,
+            extra_next_url,
+            extra_columns,
+            extra_primary_keys,
+        )
+
+        results = await registry.resolve_multi(
+            ["extra_{}".format(extra) for extra in extras]
+        )
+        data = {
+            "ok": True,
+            "rows": rows[:page_size],
+            "next": next_value and str(next_value) or None,
+        }
+        data.update({
+            key.replace("extra_", ""): value
+            for key, value in results.items()
+            if key.startswith("extra_")
+            and key.replace("extra_", "") in extras
+        })
+        return Response.json(data, default=repr)

         async def extra_template():
             nonlocal sort
@@ -1334,3 +1388,11 @@ class TableDropView(BaseView):

         await db.execute_write_fn(drop_table)
         return Response.json({"ok": True}, status=200)
+
+
+def _get_extras(request):
+    extra_bits = request.args.getlist("_extra")
+    extras = set()
+    for bit in extra_bits:
+        extras.update(bit.split(","))
+    return extras

With that in place, http://127.0.0.1:8001/content/releases?author=25778&_size=1&_extra=count,primary_keys,columns&_facet=author returns:

{
  "ok": true,
  "rows": [
    {
      "html_url": "https://github.com/eyeseast/geocode-sqlite/releases/tag/0.1.2",
      "id": 30926270,
      "author": {
        "value": 25778,
        "label": "eyeseast"
      },
      "node_id": "MDc6UmVsZWFzZTMwOTI2Mjcw",
      "tag_name": "0.1.2",
      "target_commitish": "master",
      "name": "v0.1.2",
      "draft": 0,
      "prerelease": 1,
      "created_at": "2020-09-08T17:48:24Z",
      "published_at": "2020-09-08T17:50:15Z",
      "body": "Basic API is in place, with CLI support for Google, Bing, MapQuest and Nominatum (OSM) geocoders.",
      "repo": {
        "value": 293361514,
        "label": "geocode-sqlite"
      },
      "reactions": null,
      "mentions_count": null
    }
  ],
  "next": "30926270",
  "primary_keys": [
    "id"
  ],
  "columns": [
    "html_url",
    "id",
    "author",
    "node_id",
    "tag_name",
    "target_commitish",
    "name",
    "draft",
    "prerelease",
    "created_at",
    "published_at",
    "body",
    "repo",
    "reactions",
    "mentions_count"
  ],
  "count": 25,
  "facet_results": {
    "results": {
      "author": {
        "name": "author",
        "type": "column",
        "hideable": true,
        "toggle_url": "/content/releases?author=25778&_size=1&_extra=count%2Cprimary_keys%2Ccolumns",
        "results": [
          {
            "value": 25778,
            "label": "eyeseast",
            "count": 25,
            "toggle_url": "http://127.0.0.1:8001/content/releases?_size=1&_extra=count%2Cprimary_keys%2Ccolumns&_facet=author",
            "selected": true
          }
        ],
        "truncated": false
      }
    },
    "timed_out": []
  }
}
simonw commented 1 year ago

Implementing this to work with the .json extension is going to be a lot harder.

The challenge here is that we're working with the whole BaseView() v.s. TableView() abstraction, which I've been wanting to get rid of for a long time.

BaseView() calls .data() and expects to get back a (data, extra_template_data, templates) tuple - then if a format is in play (.json or .geojson or similar from a plugin) it hands off data to that. If .csv is involved it does something special, in order to support streaming responses. And if it's regular HTML it calls await extra_template_data() and combines that with data and passes it to the template.

I want this to work completely differently: I want the formats (including HTML) to have the option of adding some extra ?_extra= extras, then I want HTML to be able to render the page entirely from the JSON if necessary.

simonw commented 1 year ago

I pushed my prototype so far, going to start a draft PR for it.

simonw commented 1 year ago

It's annoying that the https://docs.datasette.io/en/0.64.1/plugin_hooks.html#register-output-renderer-datasette plugin hook passes rows as "list of sqlite3.Row objects" - I'd prefer it if that plugin hook worked with JSON data, not sqlite3.Row.

https://docs.datasette.io/en/0.64.1/plugin_hooks.html#render-cell-row-value-column-table-database-datasette is documented as accepting Row but actually gets CustomRow, see:

simonw commented 1 year ago

Maybe "rows" should be a default ?_extra=... but it should be possible to request "arrays" instead which would be a list of arrays, more suitable perhaps for custom renderers such as the CSV one.

This could be quite neat, in that EVERY key in the JSON representation would be defined as an extra - just some would be on by default. There could even be a mechanism for turning them back off again, maybe using ?_extra=-rows.

In which case maybe ?_extra= isn't actually the right name for this feature. It could be ?_key= perhaps, or ?_field=.

Being able to pass ?_field=count,-rows to get back just the count (and skip executing the count entirely) would be pretty neat.

Although ?_only=count would be tidier. So maybe the pair of ?_only= and ?_extra= would make sense.

Would ?_only=rows still return the "ok" field so you can always look at that to confirm an error didn't occur?

simonw commented 1 year ago

This issue here would benefit from some kid of mechanism for returning just the HTML of the table itself, without any of the surrounding material. I'm not sure if that would make sense as an extra or not:

simonw commented 1 year ago

I think that does make sense: ?_extra=table perhaps, which would add {"table": "..."}.

simonw commented 1 year ago

Just realized that it's useful to be able to tell what parameters were used to generate a page... but reflecting things like _next back in the JSON is confusing in the presence of next.

So I'm going to add an extra for that information too.

Not sure what to call it though:

I'm going to experiment with a request extra that returns some bits of information about the request.

simonw commented 1 year ago

I just landed this PR so this feature is now in main:

Still needs documentation and maybe some extra tests too.

simonw commented 1 year ago

I need to get the arbitrary query page to return the same format. It likely won't have nearly as many extras.