pudo / dataset

Easy-to-use data handling for SQL data stores with support for implicit table creation, bulk loading, and transactions.
https://dataset.readthedocs.org/
MIT License
4.76k stars 297 forks source link

Upgrade SQLAlchemy to version 2 #415

Open M4rque2 opened 1 year ago

M4rque2 commented 1 year ago

All unit tests passed

miraculixx commented 1 year ago

This is great! Solves #411 imho

@pudo Do you think this could be merged?

qbit commented 1 year ago

It will also need a bump in setup.py, no?

brrttwrks commented 1 year ago

Howdy. We are using a library that depends on dataset and would like to see what specifically is preventing this pull request and how we can help git it where it needs to be. Is this pull request adequate or what is holding it back?

pudo commented 1 year ago

Ok, so I've played with this a little bit on #420, and the problem is that this will only work on SQLite, not any DB with proper transactions management. SQLAlchemy has fundamentally gotten rid of autocommit mode, which was underpinning dataset. While there's probably a way of faking autocommit mode on the library level, I wonder if the cleaner approach wouldn't be to actually make transaction management in dataset more explicit. It gets rid of some of the magicness, but would probably lead people towards using tx'en correctly and therefore creating much faster scripts.

pudo commented 1 year ago

The more I dig into this the less certain I am that we're not due a massive re-work of the library in order to harmonize transaction handling and how it intersects with DDL commands. At the moment, the tests are producing a bunch of deadlocks when running against PostgreSQL.

Does anyone here really understand transactions, and how they relate to schema modification?

miraculixx commented 1 year ago

@pudo

I wonder if the cleaner approach wouldn't be to actually make transaction management in dataset more explicit. It gets rid of some of the magicness (...)

I think the default should be autocommit, as that's what (IMHO) one would generally expect from a library like dataset. dataset's advantage is to avoid sqlalchemy's complexity, and I feel making transactions explicit as a default would take away from this. For more transaction control, dataset already provides a respective API, right? https://dataset.readthedocs.io/en/latest/quickstart.html#using-transactions

For the case where full transaction control and speed (and possibly explicit control over DDL) is a driving critiera, I suggest sqlalchemy native is a better choice.

miraculixx commented 1 year ago

@pudo

The more I dig into this the less certain I am that we're not due a massive re-work of the library in order to harmonize transaction handling and how it intersects with DDL commands.

I added my thoughts , hope it helps (I don't know the library well so I may be wrong). Afaik DDL commands cannot be executed inside another transaction, at least that's what the alembic docs on autocommit_block() seem to imply:

This special directive is intended to support the occasional database DDL or system operation that specifically has to be run outside of any kind of transaction block. The PostgreSQL database platform is the most common target for this style of operation, as many of its DDL operations must be run outside of transaction blocks, even though the database overall supports transactional DDL.