simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.59k stars 691 forks source link

await datasette.client.get(path) mechanism for executing internal requests #943

Closed simonw closed 4 years ago

simonw commented 4 years ago

datasette-graphql works by making internal requests to the TableView class (in order to take advantage of existing pagination logic, plus options like ?_search= and ?_where=) - see #915

I want to support a mod_rewrite style mechanism for putting nicer URLs on top of Datasette pages - I botched that together for a project here using an internal ASGI proxying trick: https://github.com/natbat/tidepools_near_me/commit/ec102c6da5a5d86f17628740d90b6365b671b5e1

If the datasette object provided a documented method for executing internal requests (in a way that makes sense with logging etc - i.e. doesn't get logged as a separate request) both of these use-cases would be much neater.

simonw commented 4 years ago

Could be as simple as response = await datasette.get("/path/blah") - which could also be re-used by the implementation of the datasette --get / CLI option introduced in #927.

Bit weird calling it .get() since that clashes with Python's dictionary .get() method.

simonw commented 4 years ago

Should it default to treating things as if they had the .json extension? There are use-cases for the non-JSON method, such as https://github.com/natbat/tidepools_near_me/commit/ec102c6da5a5d86f17628740d90b6365b671b5e1

I think I'm OK with people having to add .json to their internal calls. Maybe they could use format="json") as an optional parameter which would automatically handle the very weird edge-cases where you need to use ?_format=json instead of .json (due to table names existing with a .json suffix).

simonw commented 4 years ago

Alternative name possibilities:

simonw commented 4 years ago

Actually no - requests.get() and httpx.get() prove that having a .get() method for an HTTP-related API isn't confusing to people at all.

datasette.get() it is.

(I'll probably add datasette.post() in the future too).

simonw commented 4 years ago

Should internal requests executed in this way be handled by plugins that used the asgi_wrapper() hook?

Hard to be sure one way or the other. I'm worried about logging middleware triggering twice - but actually anyone doing serious logging of their Datasette instance is probably doing it in a different layer (uvicorn logs or nginx proxy or whatever) so they wouldn't be affected. There aren't any ASGI logging middlewares out there that I've seen.

Also: if you run into a situation where your stuff is breaking because datasette.get() is calling ASGI middleware twice you can fix it by running your ASGI middleware outside of the asgi_wrapper plugin hook mechanism.

So I think it DOES execute asgi_wrapper() middleware.

simonw commented 4 years ago

What about authentication checks etc? Won't they run twice?

I think that's OK too, in fact it's desirable: think of the case of datasette-graphql where a bunch of different TableView calls are being made as part of the same GraphQL queries. Having those calls take advantage of finely grained per-table authentication and permission checks seems like a good feature.

simonw commented 4 years ago

Right now calling datasette.app() instantiates an ASGI application - complete with a bunch of routes and wrappers - and returns that application object. Calling it twice instantiates another ASGI application.

I think a single Datasette instance should only ever create a single ASGI app - so the .app() method should cache the ASGI app that it returns the first time and return the same application again on future calls.

simonw commented 4 years ago

One thing to consider here: Datasette's table and database name escaping rules can be a little bit convoluted.

If a plugin wants to get back the first five rows of a table, it will need to construct a URL /dbname/tablename?_size=5 - but it will need to know how to turn the database and table names into the correctly escaped dbname and tablename values.

Here's how the row.html table handles that right now: https://github.com/simonw/datasette/blob/b21ed237ab940768574c834aa5a7130724bd3a2d/datasette/templates/row.html#L19-L23

It would be an improvement to have this logic abstracted out somewhere and documented so plugins can use it.

simonw commented 4 years ago

Maybe allow this:

response = await datasette.get("/{database}/{table}.json", database=database, table=table)

This could cause problems if users ever need to pass literal { in their paths. Maybe allow this too:

response = await datasette.get("/{database}/{table}.json", interpolate=False)

Not convinced this is useful - it's a bit unintuitive.

simonw commented 4 years ago

I just realised that this mechanism is kind of like being able to use microservices - make API calls within your application - except that everything runs in the same process against SQLite databases so calls will be lightning fast.

It also means that a plugin can add a new internal API to Datasette that's accessible to other plugins by registering a new route with register_routes!

simonw commented 4 years ago

Also fun: the inevitable plugin that exposes this to the template language - so Datasette templates can stitch together data from multiple other internal API calls. Fun way to take advantage of async support in Jinja.

simonw commented 4 years ago

Need to decide what to do about JSON responses.

When called from a template it's likely the intent will be to further loop through the JSON data returned. It would be annoying to have to run json.loads here.

Maybe a .get_json() method then? Or even return a response that has .json() and .text similar to httpx - or just return an httpx response.

simonw commented 4 years ago

I'm leaning towards defaulting to JSON as the requested format - you can pass format="html" if you want HTML.

But weird that it's different from the web UI.

simonw commented 4 years ago

Maybe .get vs .get_html?

simonw commented 4 years ago

I'm not going to mess around with formats - you'll get back the exact response that a web client would receive.

Question: what should the response object look like? e.g. if you do:

response = await datasette.get("/db/table.json")

What should response be?

I could reuse the Datasette Response class from datasette.utils.asgi. This would work well for regular responses which just have a status code, some headers and a response body. It wouldn't be great for streaming responses though such as you get back from ?_stream=1 CSV exports.

simonw commented 4 years ago

So what should I do about streaming responses?

I could deliberately ignore them - through an exception if you attempt to run await datasette.get(...) against a streaming URL.

I could load the entire response into memory and return it as a wrapped object.

I could support some kind of asynchronous iterator mechanism. This would be pretty elegant if I could decide the right syntax for it - it would allow plugins to take advantage of other internal URLs that return streaming content without needing to load that content entirely into memory in order to process it.

simonw commented 4 years ago

Maybe these methods become the way most Datasette tests are written, replacing the existing TestClient mechanism?

simonw commented 4 years ago

I'm tempted to create a await datasette.request() method which can take any HTTP verb - then have datasette.get() and datasette.post() as thin wrappers around it.

simonw commented 4 years ago

What if datasette.get() was an alias for httpx.get(), pre-configured to route to the correct application? And with some sugar that added http://localhost/ to the beginning of the path if it was missing?

This would make httpx a dependency of core Datasette, which I think is OK.

It would also solve the return type problem: I would return whatever httpx returns.

simonw commented 4 years ago

I could solve streaming using something like this:

async with datasette.stream("GET", "/fixtures/compound_three_primary_keys.csv?_stream=on&_size=max") as response:
    async for chunk in response.aiter_bytes():
        print(chunk)

Which would be a wrapper around AsyncClient.stream(method, url, ...) from https://www.python-httpx.org/async/#streaming-responses

simonw commented 4 years ago

I think I can use async with httpx.AsyncClient(base_url="http://localhost/") as client: to ensure I don't need to use http://localhost/ on every call.

simonw commented 4 years ago

Maybe instead of implementing datasette.get() and datasette.post() and datasette.request() and datasette.stream() I could instead have a nested object called datasette.client which is a preconfigured AsyncClient instance.

response = await datasette.client.get("/")

Or perhaps this should be a method in case I ever need to be able to await it:

response = await (await datasette.client()).get("/")

This is a bit cosmetically ugly though, I'd rather avoid that if possible.

Maybe I could get this working by returning an object from .client() which provides a await obj.get() method:

response = await datasette.client().get("/")

I don't think there's any benefit to that over await datasette.client.get() though.

simonw commented 4 years ago

Should I instantiate a single Client and reuse it for all internal requests, or can I instantiate a new Client for each request?

https://www.python-httpx.org/advanced/#why-use-a-client says that the main benefit of a Client instance is HTTP connection pooling - which isn't an issue for these internal requests since they won't be using the HTTP protocol at all, they'll be calling the ASGI application directly.

So I'm leaning towards instantiating a fresh client for every internal request. I'll run a microbenchmark to check that this doesn't have any unpleasant performance implications.

simonw commented 4 years ago

dogsheep-beta could do with this too. It currently makes a call to TableView in a similar way to datasette-graphql in order to calculate facets.

dogsheep-beta would benefit with a mechanism for changing the facet timeout setting during that call (as would datasette-graphql, see the DatasetteSpecialConfig mechanism it uses).

simonw commented 4 years ago

I put together a minimal prototype of this and it feels pretty good:

diff --git a/datasette/app.py b/datasette/app.py
index 20aae7d..fb3bdad 100644
--- a/datasette/app.py
+++ b/datasette/app.py
@@ -4,6 +4,7 @@ import collections
 import datetime
 import glob
 import hashlib
+import httpx
 import inspect
 import itertools
 from itsdangerous import BadSignature
@@ -312,6 +313,7 @@ class Datasette:
         self._register_renderers()
         self._permission_checks = collections.deque(maxlen=200)
         self._root_token = secrets.token_hex(32)
+        self.client = DatasetteClient(self)

     async def invoke_startup(self):
         for hook in pm.hook.startup(datasette=self):
@@ -1209,3 +1211,25 @@ def route_pattern_from_filepath(filepath):

 class NotFoundExplicit(NotFound):
     pass
+
+
+class DatasetteClient:
+    def __init__(self, ds):
+        self.app = ds.app()
+
+    def _fix(self, path):
+        if path.startswith("/"):
+            path = "http://localhost{}".format(path)
+        return path
+
+    async def get(self, path, **kwargs):
+        async with httpx.AsyncClient(app=self.app) as client:
+            return await client.get(self._fix(path), **kwargs)
+
+    async def post(self, path, **kwargs):
+        async with httpx.AsyncClient(app=self.app) as client:
+            return await client.post(self._fix(path), **kwargs)
+
+    async def options(self, path, **kwargs):
+        async with httpx.AsyncClient(app=self.app) as client:
+            return await client.options(self._fix(path), **kwargs)

Used like this in ipython:

In [1]: from datasette.app import Datasette

In [2]: ds = Datasette(["fixtures.db"])

In [3]: (await ds.client.get("/-/config.json")).json()
Out[3]: 
{'default_page_size': 100,
 'max_returned_rows': 1000,
 'num_sql_threads': 3,
 'sql_time_limit_ms': 1000,
 'default_facet_size': 30,
 'facet_time_limit_ms': 200,
 'facet_suggest_time_limit_ms': 50,
 'hash_urls': False,
 'allow_facet': True,
 'allow_download': True,
 'suggest_facets': True,
 'default_cache_ttl': 5,
 'default_cache_ttl_hashed': 31536000,
 'cache_size_kb': 0,
 'allow_csv_stream': True,
 'max_csv_mb': 100,
 'truncate_cells_html': 2048,
 'force_https_urls': False,
 'template_debug': False,
 'base_url': '/'}

In [4]: (await ds.client.get("/fixtures/facetable.json?_shape=array")).json()
Out[4]: 
[{'pk': 1,
  'created': '2019-01-14 08:00:00',
  'planet_int': 1,
  'on_earth': 1,
  'state': 'CA',
  'city_id': 1,
  'neighborhood': 'Mission',
  'tags': '["tag1", "tag2"]',
  'complex_array': '[{"foo": "bar"}]',
  'distinct_some_null': 'one'},
 {'pk': 2,
  'created': '2019-01-14 08:00:00',
  'planet_int': 1,
  'on_earth': 1,
  'state': 'CA',
  'city_id': 1,
  'neighborhood': 'Dogpatch',
  'tags': '["tag1", "tag3"]',
  'complex_array': '[]',
  'distinct_some_null': 'two'},
simonw commented 4 years ago

This adds httpx as a dependency - I think I'm OK with that. I use it for testing in all of my plugins anyway.

simonw commented 4 years ago

How important is it to use httpx.AsyncClient with a context manager?

https://www.python-httpx.org/async/#opening-and-closing-clients says:

Alternatively, use await client.aclose() if you want to close a client explicitly:

client = httpx.AsyncClient()
...
await client.aclose()

The .aclose() method has a comment saying "Close transport and proxies" - I'm not using proxies, so the relevant implementation seems to be a call to await self._transport.aclose() in https://github.com/encode/httpx/blob/f932af9172d15a803ad40061a4c2c0cd891645cf/httpx/_client.py#L1741-L1751

The transport I am using is a class called ASGITransport in https://github.com/encode/httpx/blob/master/httpx/_transports/asgi.py

The aclose() method on that class does nothing. So it looks like I can instantiate a client without bothering with the async with httpx.AsyncClient bit.

simonw commented 4 years ago

Even smaller DatasetteClient implementation:

class DatasetteClient:
    def __init__(self, ds):
        self._client = httpx.AsyncClient(app=ds.app())

    def _fix(self, path):
        if path.startswith("/"):
            path = "http://localhost{}".format(path)
        return path

    async def get(self, path, **kwargs):
        return await self._client.get(self._fix(path), **kwargs)

    async def post(self, path, **kwargs):
        return await self._client.post(self._fix(path), **kwargs)

    async def options(self, path, **kwargs):
        return await self._client.options(self._fix(path), **kwargs)
simonw commented 4 years ago

I may as well implement all of the HTTP methods supported by the httpx client:

simonw commented 4 years ago
class DatasetteClient:
    def __init__(self, ds):
        self._client = httpx.AsyncClient(app=ds.app())

    def _fix(self, path):
        if path.startswith("/"):
            path = "http://localhost{}".format(path)
        return path

    async def get(self, path, **kwargs):
        return await self._client.get(self._fix(path), **kwargs)

    async def options(self, path, **kwargs):
        return await self._client.options(self._fix(path), **kwargs)

    async def head(self, path, **kwargs):
        return await self._client.head(self._fix(path), **kwargs)

    async def post(self, path, **kwargs):
        return await self._client.post(self._fix(path), **kwargs)

    async def put(self, path, **kwargs):
        return await self._client.put(self._fix(path), **kwargs)

    async def patch(self, path, **kwargs):
        return await self._client.patch(self._fix(path), **kwargs)

    async def delete(self, path, **kwargs):
        return await self._client.delete(self._fix(path), **kwargs)
simonw commented 4 years ago

Am I going to rewrite ALL of my tests to use this instead? It would clean up a lot of test code, at the cost of quite a bit of work.

It would make for much neater plugin tests too, and neater testing documentation: https://docs.datasette.io/en/stable/testing_plugins.html

simonw commented 4 years ago

I want this in Datasette 0.50, so I can use it in datasette-graphql and suchlike.

simonw commented 4 years ago

Documentation (from #1006): https://docs.datasette.io/en/latest/internals.html#client