simonw / sqlite-utils

Python CLI utility and library for manipulating SQLite databases
https://sqlite-utils.datasette.io
Apache License 2.0
1.67k stars 111 forks source link

Async support #242

Open eyeseast opened 3 years ago

eyeseast commented 3 years ago

Following our conversation last week, want to note this here before I forget.

I've had a couple situations where I'd like to do a bunch of updates in an async event loop, but I run into SQLite's issues with concurrent writes. This feels like something sqlite-utils could help with.

PeeWee ORM has a SQLite write queue that might be a good model. It's using threads or gevent, but I think that approach would translate well enough to asyncio.

Happy to help with this, too.

simonw commented 3 years ago

Datasette has its own implementation of a write queue for exactly this purpose - and there's no reason at all that should stay in Datasette rather than being extracted out and moved over here to sqlite-utils.

One small concern I have is around the API design. I'd want to keep supporting the existing synchronous API while also providing a similar API with await-based methods.

What are some good examples of libraries that do this? I like how https://www.python-httpx.org/ handles it, maybe that's a good example to imitate?

simonw commented 3 years ago

Another option here would be to add https://github.com/omnilib/aiosqlite/blob/main/aiosqlite/core.py as a dependency - it's four years old now and actively marinated, and the code is pretty small so it looks like a solid, stable, reliable dependency.

eyeseast commented 3 years ago

I think HTTPX gets it exactly right, with a clear separation between sync and async clients, each with a basically identical API. (I'm about to switch feed-to-sqlite over to it, from Requests, to eventually make way for async support.)

simonw commented 3 years ago

I have a hunch this is actually going to be quite difficult, due to the internal complexity of some of the sqlite-utils API methods.

Consider db[table].extract(...) for example. It does a whole bunch of extra queries inside the method - each of those would need to be turned into an await call for the async version. Here's the method body today:

https://github.com/simonw/sqlite-utils/blob/09c3386f55f766b135b6a1c00295646c4ae29bec/sqlite_utils/db.py#L1060-L1152

Writing this method twice - looking similar but with await ... tucked in before every internal method it calls that needs to execute SQL - is going to be pretty messy.

One thing that would help a LOT is figuring out how to share the majority of the test code. If the exact same tests could run against both the sync and async versions with a bit of test trickery, maintaining parallel implementations would at least be a bit more feasible.

simonw commented 3 years ago

Here's a really wild idea: I wonder if it would be possible to run a source transformation against either the sync or the async versions of the code to produce the equivalent for the other paradigm?

Could that even be as simple as a set of regular expressions against the await ... version that strips out or replaces the await and async def and async for statements?

If so... I could maintain just the async version, generate the sync version with a script and rely on robust unit testing to guarantee that this actually works.

polyrand commented 3 years ago

I had this resource by Seth Michael Larson saved https://github.com/sethmlarson/pycon-async-sync-poster I haven't had a look at it, but it may contain useful info.

On twitter, I mentioned passing an aiosqlite connection during the Database creation. I'm not 100% familiar with the sqlite-utils codebase, so I may be wrong here, but maybe decorating internal functions could be an option? Then they are awaited or not inside the decorator depending on how they are called.

simonw commented 3 years ago

"how to use some new tools to more easily maintain a codebase that supports both async and synchronous I/O and multiple async libraries" - yeah that's exactly what I need, thank you!

simonw commented 3 years ago

unasync is an implementation of the exact pattern I was talking about above - it uses the tokenize module from the Python standard library to apply some clever rules to transform an async codebase into a sync one. https://unasync.readthedocs.io/en/latest/ - implementation here: https://github.com/python-trio/unasync/blob/v0.5.0/src/unasync/__init__.py

simonw commented 3 years ago

From the poster here: https://github.com/sethmlarson/pycon-async-sync-poster/blob/master/poster.pdf

pycon-async-sync-poster_poster_pdf_at_master_·_sethmlarson_pycon-async-sync-poster
simonw commented 3 years ago

Some other interesting background reading: https://docs.sqlalchemy.org/en/14/orm/extensions/asyncio.html - in particular see how SQLALchemy has a await conn.run_sync(meta.drop_all) mechanism for running methods that haven't themselves been provided in an async version

simonw commented 3 years ago

Hah or use this trick, which genuinely rewrites the code at runtime using a class decorator! https://github.com/python-happybase/aiohappybase/blob/0990ef45cfdb720dc987afdb4957a0fac591cb99/aiohappybase/sync/_util.py#L19-L32

simonw commented 3 years ago

Microsoft's playwright Python library solves this problem by code generating both their sync AND their async libraries https://github.com/microsoft/playwright-python/tree/master/scripts

eyeseast commented 3 years ago

I've been thinking about this a bit lately, doing a project that involves moving a lot of data in and out of SQLite files, datasette and GeoJSON. This has me leaning toward the idea that something like datasette query would be a better place to do async queries.

I know there's a lot of overlap in sqlite-utils and datasette, and maybe keeping sqlite-utils synchronous would let datasette be entirely async and give a cleaner separation of implementations.