sqlalchemy / sqlalchemy

The Database Toolkit for Python
https://www.sqlalchemy.org
MIT License
9.44k stars 1.41k forks source link

Behavior of SQL compiler's "visit_update" on SQLAlchemy 1.4 #5779

Closed amotl closed 3 years ago

amotl commented 3 years ago

Dear Mike,

first of all, I want to salute you for all your efforts on the upcoming SQLAlchemy 1.4 version and all the exciting features it will bring to the table, specifically the transparent SQL compilation caching as outlined at [0,1] as well as its support for Asynchronous I/O [2].

At https://github.com/crate/crate-python/pull/391, I am currently trying to adjust the SQLAlchemy Dialect for CrateDB [3] to the new architecture of SQLAlchemy 1.4. While navigating the code base of tree/rel_1_4_0b1, I have come pretty far right now.

However, I am struggling at one specific place, so I am humbly asking if you could spend a minute to help me back on track.

We need to rewrite parameters to SQL UPDATE expressions to enable partial updates like outlined at [4] and we are doing that by a) hooking into the before_execute event and by b) amending the SQL compiler's visit_update() method at the place where the SET clause element is constructed [5]. At a), the parameters are rewritten and at b), the rewritten parameters are applied by modifying the SET clause.

Now, after I pretty much followed through different changes including the new mangling of multiparams vs. params [6,7], I am now hitting a road block when trying to access update_stmt.parameters within visit_update(). Before, the whole dictionary of - already rewritten - parameters has been available there like

update_stmt.parameters: {"data['nested']": {'x': 3, 'y': {'z': 2}}, 'characters_name': 'Trillian'}

so we have been able to distill that into

["data['nested'] = ?"]

to be used as the rewritten parameter name on the left hand side and its rewritten parameter value on the right hand side in order to modify the SET clause like [5]:

for k, v in update_stmt.parameters:
    if isinstance(k, str) and '[' in k:
        bindparam = sa.sql.bindparam(k, v)
        clause = k + ' = ' + self.process(bindparam)
        set_clauses.append(clause)

Now, SQLAlchemy 1.4 will croak with

AttributeError: 'Update' object has no attribute 'parameters'

I already tried to find the parameters dictionary through compile_state._multi_parameters, compile_state._dict_parameters, update_stmt._multi_values or other attributes within both compile_state as well as update_stmt, but all of them are just empty. The closest thing I have been able to find is self.column_keys: ['characters_name', "data['nested']"]], but these are only the parameter names and I believe we need the full parameter dictionary there.

Apparently, I am too blind here to get hold of where I might be able to find the parameter dictionary back, even after getting an idea from both dir(compile_state), dir(update_stmt) and familiarizing myself with SQLAlchemy's updated code base.

Maybe you already have quick idea what we are doing wrong or a suggestion in mind which path we should follow.

Thank you very much in advance, keep up the spirit and with kind regards, Andreas.

[0] https://docs.sqlalchemy.org/en/14/changelog/migration_14.html#transparent-sql-compilation-caching-added-to-all-dql-dml-statements-in-core-orm [1] https://docs.sqlalchemy.org/en/14/core/connections.html#sql-caching [2] https://docs.sqlalchemy.org/en/14/orm/extensions/asyncio.html [3] https://github.com/crate/crate-python/tree/0.26.0/src/crate/client/sqlalchemy [4] https://github.com/crate/crate-python/blob/0.26.0/src/crate/client/sqlalchemy/compiler.py#L32-L46 [5] https://github.com/crate/crate-python/blob/0.26.0/src/crate/client/sqlalchemy/compiler.py#L324-L333 [6] https://github.com/sqlalchemy/sqlalchemy/blob/rel_1_4_0b1/lib/sqlalchemy/engine/base.py#L1236-L1239 [7] https://github.com/sqlalchemy/sqlalchemy/blob/rel_1_4_0b1/lib/sqlalchemy/engine/base.py#L1257-L1260

zzzeek commented 3 years ago

hey there Andreas -

you are on the right track looking in compile_state, do you have your newer code that uses this? I dont see compile_state referenced in https://github.com/crate/crate-python/blob/0.26.0/src/crate/client/sqlalchemy/compiler.py#L324-L333 .

the beginning of a complete visit_update that isn't using the "super" version would have to start like this:

def visit_update(self, update_stmt, **kw):
    compile_state = update_stmt._compile_state_factory(
        update_stmt, self, **kw
    )
    update_stmt = compile_state.statement

I would recommend taking the UPDATE statement that's not working, try it out on a SQLite database, and put a pdb.set_trace() in SQLalchemy compiler.py -> visit_update(), then poke around that compile_state to see all the state that is present. that should all be the same if the dialect is then a crate dialect.

Also, for the before_execute() thing, if you want to rewrite paramters, a much better place for a dialect to do that is in dialect.do_execute() and dialect.do_executemany() (will work in 1.3 also). See for example psycopg2 do_executemany() that does a bunch of rewriting. https://github.com/sqlalchemy/sqlalchemy/blob/master/lib/sqlalchemy/dialects/postgresql/psycopg2.py#L905

amotl commented 3 years ago

Hi Mike,

thanks for your quick answer and for sharing valuable insights.

Do you have your newer code that uses this?

Yes, I just now updated https://github.com/crate/crate-python/pull/391, which was too dirty before when I was writing my initial post.

One step forward

Within the CrateDB-specific amendment to visit_update_14(), I am now using compile_state._dict_parameters instead of update_stmt.parameters which already makes some of the more basic tests like test_update_with_dict_column pass. I don't know why it didn't work before, but by poking again using pdb.set_trace(), I have been able to reassure myself that compile_state._dict_parameters was indeed the right thing to use here. Maybe some other updated bits of the code were misplaced. Thanks for this suggestion - that helped!

My conclusion on that is that the SQL building within CrateCompiler(compiler.SQLCompiler) works reasonably now.

But still

However, more involved tests like test_nested_object_change_tracking are still failing, mostly because compiled statement caching will be involved here. Currently, that croaks around

  File "/path/to/sqlalchemy/tests/dict_test.py", line 389, in test_nested_object_change_tracking
    session.commit()

[...]

  File "/path/to/sqlalchemy/engine/base.py", line 1304, in _execute_clauseelement
    compiled_sql, extracted_params, cache_hit = elem._compile_w_cache(
  File "/path/to/sqlalchemy/sql/elements.py", line 519, in _compile_w_cache
    compiled_sql = compiled_cache.get(key)
  File "/path/to/sqlalchemy/util/_collections.py", line 876, in get
    item = dict.get(self, key, default)

TypeError: unhashable type: 'dict'

This is most probably happening because we have a dictionary on the rvalue side, which appears within the cache key as ({"data['nested']": {'x': 3, 'y': {'z': 2}}, 'characters_name': 'Trillian'},), as seen through pdb.set_trace() at sqlalchemy.util._collections.LRUCache.get:

(Pdb) display key
display key: (<crate.client.sqlalchemy.dialect.CrateDialect object at 0x1041abe50>, ('0', <class 'sqlalchemy.sql.dml.Update'>, 'table', (Table('characters', MetaData(bind=Engine(crate:///?cratedb-version=4.3.2)), Column('name', String(), table=<characters>, primary_key=True, nullable=False), Column('age', Integer(), table=<characters>), Column('data', _Craty(), table=<characters>), Column('data_list', _ObjectArray(), table=<characters>), schema=None),), '_where_criteria', (('1', <class 'sqlalchemy.sql.elements.BooleanClauseList'>, 'clauses', (('2', <class 'sqlalchemy.sql.elements.BinaryExpression'>, 'left', ('3', <class 'sqlalchemy.sql.schema.Column'>, 'name', 'name', 'type', (<class 'sqlalchemy.sql.sqltypes.String'>, ('length', None), ('collation', None)), 'table', (Table('characters', MetaData(bind=Engine(crate:///?cratedb-version=4.3.2)), Column('name', String(), table=<characters>, primary_key=True, nullable=False), Column('age', Integer(), table=<characters>), Column('data', _Craty(), table=<characters>), Column('data_list', _ObjectArray(), table=<characters>), schema=None),)), 'right', ('4', <class 'sqlalchemy.sql.elements.BindParameter'>, (<class 'sqlalchemy.sql.sqltypes.String'>, ('length', None), ('collation', None)), 'characters_name'), 'operator', <built-in function eq>, 'negate', <built-in function ne>, 'type', (<class 'sqlalchemy.sql.sqltypes.Boolean'>, ('name', None), ('create_constraint', False))),), 'operator', <built-in function and_>),)), ({"data['nested']": {'x': 3, 'y': {'z': 2}}, 'characters_name': 'Trillian'},), False, False)

After I found that sqlalchemy.sql.element.ClauseElement._compile_w_cache would also gracefully work when compiled_cache was None, I already tried to opt-out of caching by forcing compiled_cache = None within sqlalchemy.engine.base.Connection._execute_clauseelement(), but did not have prompt success with that:

sqlalchemy.exc.ArgumentError: SET/VALUES column expression or string key expected, got {"data['nested']": {'x': 3, 'y': {'z': 2}}, 'characters_name': 'Trillian'}

My conclusion on that is that SQLAlchemy's updated parameter juggling for our non-scalar rvalues is not yet happy a) with the compiled SQL caching and b) maybe also other things beyond that. If you have any idea how to get away with this, I am all ears.

Favor DialectEvents over ConnectionEvents

Also, for the before_execute() thing, if you want to rewrite parameters, a much better place for a dialect to do that is in dialect.do_execute() and dialect.do_executemany() (will work in 1.3 also).

Thank you for this suggestion. I will try to improve the situation by moving the parameter rewriting to dialect.do_executeX() instead of before_execute(). My colleagues probably followed the documentation of before_execute, which states:

This event is good for debugging SQL compilation issues as well as early manipulation of the parameters being sent to the database, as the parameter lists will be in a consistent format here.

in favor of DialectEvents.do_execute, because DialectEvents claimed that

DialectEvents hooks should be considered semi-public and experimental. These hooks are not for general use and are only for those situations where [...]

So, I am very happy to learn otherwise that (nomen est omen), using DialectEvents here would be better suited for our use case in general.

Thoughts

Maybe I should explore parameter rewriting on the DialectEvents road and all the hassle I've outlined above will just go away. Do you actually see a chance for that happening?

I am currently not sure because I am inclined to believe that we are already successful a) past the rewriting phase and b) past the SQL statement building phase and now hit the wall at how parameters are traveling through the framework, where our non-scalar rvalues eventually conflict with the newly introduced SQL compiled statement caching.

Thank you in advance and with kind regards, Andreas.

zzzeek commented 3 years ago

But still

However, more involved tests like test_nested_object_change_tracking are still failing, mostly because compiled statement caching will be involved here. Currently, that croaks around

  File "/path/to/sqlalchemy/tests/dict_test.py", line 389, in test_nested_object_change_tracking
    session.commit()

[...]

  File "/path/to/sqlalchemy/engine/base.py", line 1304, in _execute_clauseelement
    compiled_sql, extracted_params, cache_hit = elem._compile_w_cache(
  File "/path/to/sqlalchemy/sql/elements.py", line 519, in _compile_w_cache
    compiled_sql = compiled_cache.get(key)
  File "/path/to/sqlalchemy/util/_collections.py", line 876, in get
    item = dict.get(self, key, default)

TypeError: unhashable type: 'dict'

This is most probably happening because we have a dictionary on the rvalue side, which appears within the cache key as ({"data['nested']": {'x': 3, 'y': {'z': 2}}, 'characters_name': 'Trillian'},), as seen through pdb.set_trace() at sqlalchemy.util._collections.LRUCache.get:

I don't see why the values would have any impact on a cache key. Can you please provide a simple three line example of the following form; values that are of a dicitonary / nested format are not used in the cache key since these are values not part of the SQL construct:

from sqlalchemy import table, column, String

t1 = table("some_table", column("q", String))
stmt = t1.update().values(q={"foo": {"nested": "bar"}})
print(stmt._generate_cache_key())

if you can give me a similar self-contained example as above with your failure I can understand it better. the cache key generation does not interact with a dialect at all.

Favor DialectEvents over ConnectionEvents

Also, for the before_execute() thing, if you want to rewrite parameters, a much better place for a dialect to do that is in dialect.do_execute() and dialect.do_executemany() (will work in 1.3 also).

Thank you for this suggestion. I will try to improve the situation by moving the parameter rewriting to dialect.do_executeX() instead of before_execute(). My colleagues probably followed the documentation of before_execute, which states:

This event is good for debugging SQL compilation issues as well as early manipulation of the parameters being sent to the database, as the parameter lists will be in a consistent format here.

in favor of DialectEvents.do_execute, because DialectEvents claimed that

DialectEvents hooks should be considered semi-public and experimental. These hooks are not for general use and are only for those situations where [...]

OK slight misunderstanding, you should not have to use any events, that is, no DialectEvents or ConnectionEvents at all; the code you are writing is a dialect. You would implement simple methods do_execute() and do_executemany() as methods of your class CrateDialect right here: https://github.com/crate/crate-python/blob/master/src/crate/client/sqlalchemy/dialect.py#L160 ; these override the default methods that are on DefaultDialect. dialects shouldnt use events for these kinds of things, the dialect implements the actual way that these things happen without any events riding on them.

Maybe I should explore parameter rewriting on the DialectEvents road and all the hassle I've outlined above will just go away. Do you actually see a chance for that happening?

nope, no DialectEvents, no ConnectionEvents, no events. you're the dialect!

I am currently not sure because I am inclined to believe that we are already successful a) past the rewriting phase and b) past the SQL statement building phase and now hit the wall at how parameters are traveling through the framework, where our non-scalar rvalues eventually conflict with the newly introduced SQL compiled statement caching.

if you can show me what you're doing with update(), it sounds like somehting here might be off as well, if you are somehow changing the structure of the object or something like that, not sure.

zzzeek commented 3 years ago

let me clarify that cache key above has no dictionaries in it:

from sqlalchemy import table, column, String

t1 = table("some_table", column("q", String))

stmt = t1.update().values(q={"foo": {"nested": "bar"}})

ck = stmt._generate_cache_key()

print(ck)

d = {ck[0]: True}

assert d[ck[0]] is True
amotl commented 3 years ago

Dear Mike,

thank you very much again for clarifying on my misunderstanding and for providing more insights. They are really important to me in order to properly take care about the evolution of CrateDB's SQLAlchemy dialect.

You should not have to use any events, that is, no DialectEvents or ConnectionEvents at all; the code you are writing is a dialect. Dialects shouldn't use events for [parameter manipulation]. You're the dialect!

All right, got you! So, I should probably first add a patch to rewrite the dialect to take that into account before even thinking about adding the patch to support SQLAlchemy 1.4. It might save us some hassle.

If you can show me what you're doing with update(), it sounds like something here might be off as well, if you are somehow changing the structure of the object or something like that, not sure.

As far as I have analyzed the CrateCompiler, I believe [1] is the essence of what crate-python amends to the vanilla SQLAlchemy's SQLCompiler within visit_update. Essentially, the dialect is adjusting the SET clause by putting those parameters into set_clauses which have been amended by hooking into the before_execute event beforehand.

We can now forget about the amendments to visit_insert, they have only been needed for early versions of CrateDB < 1.0.0. Please keep in mind that the dialect is around for a long time already and apparently still supports all of SQLAlchemy 1.1, 1.2 and 1.3. Saying that, we might take the chance to consider dropping support for the older versions with the current overhaul.

So, again, after learning from your suggestions that the dialect shouldn't use the event subsystem at all, it might well be something off here. On the other hand, I am not fully into the details yet and my former colleagues might have needed to do it this way in order to adjust the parameter nesting to support partial updates as outlined within [2]. Maybe this is not possible to achieve by only implementing the do_execute() and do_executemany() methods directly on the dialect?

I don't see why the values would have any impact on a cache key. The cache key generation does not interact with a dialect at all.

It is probably just me trying to do my best to a) understand the dialect implementation and b) implement support for SQLAlchemy 1.4 while still learning about the details of the machinery. Most probably there's something been put in the wrong slot coming from the current way of messing with the internals. I just wanted to share my observations so thanks again for your patience and your suggestions what the dialect might have been doing wrong/not correctly already and for outlining better ways to do things.

Can you please provide a simple self-contained three line example?

Sure, I will try. Please give me a little time to come back to this.

With kind regards, Andreas.

[1] https://github.com/crate/crate-python/blob/0.26.0/src/crate/client/sqlalchemy/compiler.py#L318-L335 [2] https://github.com/crate/crate-python/blob/0.26.0/src/crate/client/sqlalchemy/compiler.py#L32-L70

zzzeek commented 3 years ago

the dialect-level do_execute() and do_executemany() methods are the last stop in the dialect before the string statement is sent to the DBAPI. so while there are things that you wouldn't want to do there for efficiency purposes, like creating the string form of an Update() object from scratch, you could theoretically do everything there.

however, if you want to just change the string form of how Update() or Insert() look, usually you do that with the compiler visit_update / visit_insert methods as you are doing.

then, within the dialect, the usual places that dialects hook into are:

ExecutionContext.create_cursor() - if there are special things happening with cursors ExecutionContext.pre_exec() - setting up special cursor state before the statement is executed ExecutionContext.post_exec() -cursor-level asks after a statement is executed. in 1.4 only, this can also be used to set up custom resultset implementations (in 1.3 it's ExecutionContext.get_result_proxy()) Dialect.do_execute() - control how cursor.execute() is called Dialect.do_executemany() - control how cursor.executemany() is called

I would advise taking a look at the many dialects we have to see different ways these methods are used: https://github.com/sqlalchemy/sqlalchemy/tree/master/lib/sqlalchemy/dialects/