simonw / datasette

An open source multi-tool for exploring and publishing data
https://datasette.io
Apache License 2.0
9.41k stars 671 forks source link

Datasette shouldn't crash running against a database with missing extensions #2388

Open simonw opened 1 month ago

simonw commented 1 month ago

E.g. after https://til.simonwillison.net/sqlite/sqlite-vec my https://til.simonwillison.net/tils.db needs the vec0 extension if you are going to browse the vec_tils table, but the rest of the UI should work by skipping that table when scanning tables.

asg017 commented 1 month ago

This happens because of check_connection() that we call on every database connection, which calls PRAGMA table_info() on every table. Which means it fails on virtual tables that don't have their parent extension loaded.

We could ignore these class of errors with:

elif e.args[0].startswith("no such module: "):
    pass

But other parts of the codebase will still call PRAGMA table_info() at other times after connection-time, like table_column_details().

We could audit all cases of this and just return None/[] when appropriate.

simonw commented 1 month ago

I'm going to create a minimal sqlite-vec database to commit to our repo for the tests.

simonw commented 1 month ago

Creating it with sqlite-utils:

sqlite-utils install sqlite-utils-sqlite-vec
echo '{"point": "[1,2,3]"}' | sqlite-utils insert vec.db vectors -
sqlite-utils vec.db 'create virtual table vec_vectors using vec0(point float[3]);'
sqlite-utils vec.db 'insert into vec_vectors(rowid, point) select rowid, point from vectors'

That's a 48KB file:

ls -lah vec.db 
-rw-r--r--  1 simon  wheel    48K Aug 15 09:26 vec.db

Datasette cannot currently load it (without its own plugin or the --load-extension option):

datasette vec.db
Usage: datasette serve [OPTIONS] [FILES]...
Try 'datasette serve --help' for help.

Error: Connection to vec.db failed check: no such module: vec0

This works:

datasette vec.db --load-extension "$(python -c 'import sqlite_vec; print(sqlite_vec.loadable_path())')"
simonw commented 1 month ago

This conflicts slightly with Datasette's existing design with respect to SpatiaLite: currently Datasette shows you a useful error if you try to load a SpatiaLite database without activating the extension:

datasette tests/spatialite.db
Usage: datasette serve [OPTIONS] [FILES]...
Try 'datasette serve --help' for help.

Error: It looks like you're trying to load a SpatiaLite database without first loading the SpatiaLite module.

Try adding the --load-extension=spatialite option.

Read more: https://docs.datasette.io/en/stable/spatialite.html
simonw commented 1 month ago

One solution: turn this into a warning that gets dumped to the console but Datasette still starts up as usual.

Could even expand that a bit, so it knows a few other extensions (like sqlite-vec) and can output versions of that helpful message for those as well.

simonw commented 1 month ago

Relevant code: https://github.com/simonw/datasette/blob/05dfd34fd0dff34b64fb47e0dd1716c8bdbddfac/datasette/cli.py#L791-L825

Which runs check_connection():

https://github.com/simonw/datasette/blob/05dfd34fd0dff34b64fb47e0dd1716c8bdbddfac/datasette/utils/__init__.py#L965-L981

simonw commented 1 month ago

Design options:

  1. Any tables that cannot be read by PRAGMA table_info() are excluded by Datasette entirely - it's as if they do not exist
  2. Those tables are listed as "unreadable" but are still shown somewhere in the Datasette interface or the Datasette API
  3. Tables are invisible within Datasette but are listed in messages dumped to stderr

The second option is nicer, but requires more design decisions: where should that message go? How should we explain it to users?

So I think option 3 is the way to go for the moment.

simonw commented 1 month ago

OK, this is actually a lot of work to implement. I tried this so far:

diff --git a/datasette/utils/__init__.py b/datasette/utils/__init__.py
index 073d6e86..5469252b 100644
--- a/datasette/utils/__init__.py
+++ b/datasette/utils/__init__.py
@@ -19,6 +19,7 @@ import time
 import types
 import secrets
 import shutil
+import sys
 from typing import Iterable, List, Tuple
 import urllib
 import yaml
@@ -978,7 +979,8 @@ def check_connection(conn):
             if e.args[0] == "no such module: VirtualSpatialIndex":
                 raise SpatialiteConnectionProblem(e)
             else:
-                raise ConnectionProblem(e)
+                print(str(e), file=sys.stderr)
+                # raise ConnectionProblem(e)

 class BadMetadataError(Exception):
diff --git a/datasette/utils/internal_db.py b/datasette/utils/internal_db.py
index 626dd137..f5ae68b2 100644
--- a/datasette/utils/internal_db.py
+++ b/datasette/utils/internal_db.py
@@ -1,5 +1,6 @@
+import sys
 import textwrap
-from datasette.utils import table_column_details
+from datasette.utils import table_column_details, sqlite3

 async def init_internal_db(db):
@@ -137,7 +138,11 @@ async def populate_schema_tables(internal_db, db):
             tables_to_insert.append(
                 (database_name, table_name, table["rootpage"], table["sql"])
             )
-            columns = table_column_details(conn, table_name)
+            try:
+                columns = table_column_details(conn, table_name)
+            except sqlite3.OperationalError as ex:
+                print(f"Error accessing table {table_name}: {str(ex)}", file=sys.stderr)
+                continue
             columns_to_insert.extend(
                 {
                     **{"database_name": database_name, "table_name": table_name},

And it's revealing that there are MANY places in Datasette that might attempt to read the columns from one of these tables and run into an error.

For example, running this above patch with datasette tests/vec.db --pdb I got this traceback on trying to load the homepage:

INFO:     Uvicorn running on http://127.0.0.1:8002 (Press CTRL+C to quit)
Error accessing table vec_vectors: no such module: vec0
ERROR: conn=<sqlite3.Connection object at 0x110467140>, sql = 'select count(*) from [vec_vectors]', params = None: no such module: vec0
> /Users/simon/Dropbox/Development/datasette/datasette/utils/__init__.py(640)table_column_details()
-> for r in conn.execute(
(Pdb) c
Traceback (most recent call last):
  File "/Users/simon/Dropbox/Development/datasette/datasette/app.py", line 1745, in route_path
    response = await view(request, send)
  File "/Users/simon/Dropbox/Development/datasette/datasette/views/base.py", line 184, in view
    return await self.dispatch_request(request)
  File "/Users/simon/Dropbox/Development/datasette/datasette/views/base.py", line 138, in dispatch_request
    response = await handler(request)
  File "/Users/simon/Dropbox/Development/datasette/datasette/views/index.py", line 68, in get
    table_columns = await db.table_columns(table)
  File "/Users/simon/Dropbox/Development/datasette/datasette/database.py", line 424, in table_columns
    return await self.execute_fn(lambda conn: table_columns(conn, table))
  File "/Users/simon/Dropbox/Development/datasette/datasette/database.py", line 283, in execute_fn
    return await asyncio.get_event_loop().run_in_executor(
  File "/Users/simon/.pyenv/versions/3.10.4/lib/python3.10/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/Users/simon/Dropbox/Development/datasette/datasette/database.py", line 281, in in_thread
    return fn(conn)
  File "/Users/simon/Dropbox/Development/datasette/datasette/database.py", line 424, in <lambda>
    return await self.execute_fn(lambda conn: table_columns(conn, table))
  File "/Users/simon/Dropbox/Development/datasette/datasette/utils/__init__.py", line 632, in table_columns
    return [column.name for column in table_column_details(conn, table)]
  File "/Users/simon/Dropbox/Development/datasette/datasette/utils/__init__.py", line 640, in table_column_details
    for r in conn.execute(
sqlite3.OperationalError: no such module: vec0

I don't think I can fix this for the 1.0a15 alpha:

simonw commented 1 month ago

This would be a bit easier if I moved all code in Datasette that loops through a list of tables to consult the internal database catalog tables for those, then I could include these unreadable tables in that but have a "unreadable = 1" column which could be used to filter them out.