`op_in` post-coercion for `bindparam`s breaks its usage in pre-compiled queries

sqlalchemy / sqlalchemy

The Database Toolkit for Python

https://www.sqlalchemy.org

MIT License

9.41k stars 1.41k forks source link

`op_in` post-coercion for `bindparam`s breaks its usage in pre-compiled queries #8774

Closed AndySemenov closed 1 year ago

AndySemenov commented 1 year ago

Describe the bug

The BindParameter post-coercion in here: https://github.com/sqlalchemy/sqlalchemy/blob/9056dc02178905fb901da5c182618fb593331592/lib/sqlalchemy/sql/coercions.py#L885 always converts those into expanding ones despite that's not what a user expects.

For example, we'd like to pre-compile such a query and we'd like it to be compiled with a parameter in that:

our_q = select(our_table.columns.col_a).where(our_table.columns_col_b.in_(bindparam("input_value")))

We'd like to have such a query to get pre-compiled into a form:

SELECT col_a FROM our_table WHERE col_b IN :input_value

Such a form is intuitive and its execution is very simple:

engine.execute(text(our_q_compiled), dict(input_value=("a", "b", "c")))

But a BindParameter which is expanding is always compiled into a form of (__[POSTCOMPILE_input_value]) which breaks further execution of such a query directly with the engine.execute()

What's also commonly proposed in order to workaround this, is: to use col_b == func.any(bindparam("input_value")). I didn't make an assessment of the query planner differences in PostgreSQL for that, but anyway, the usage of such queries is counter-intuitive:

engine.execute(text("SELECT col_a FROM our_table WHERE col_b = ANY(:input_value)"), dict(input_value=("a", "b", "c")))

renders:

DataError: (psycopg2.errors.InvalidTextRepresentation) malformed array literal: ...

I understand that there could be a cast expression to wrap the tuple provided as input_value with, but, again, it's counter-intuitive (while the rest of the query builder mechanics is the best in intuitiveness, IMO)

To Reproduce

from sqlalchemy import table, column, Unicode, select, bindparam
from sqlalchemy.dialects import postgresql

tbl = table(
    "our_table",
    column(
        "col_a",
        Unicode(),
    ),
    column(
        "col_b",
        Unicode()
    )
)

query = select(tbl.columns.col_a)\
    .where(tbl.columns.col_b.in_(bindparam("input_values")))

query_text = query.compile(dialect=postgresql.dialect()).string

The expected value of query_text is:

SELECT our_table.col_a 
FROM our_table 
WHERE our_table.col_b IN :input_values

The actual value is:

SELECT our_table.col_a 
FROM our_table 
WHERE our_table.col_b IN (__[POSTCOMPILE_input_values])

Error

This does not render an error itself.

Versions

OS: Windows 10 21H2
Python: Python 3.9.13 (tags/v3.9.13:6de2ca5, May 17 2022, 16:36:42) [MSC v.1929 64 bit (AMD64)] on win32
SQLAlchemy: 1.4.41
Database: PostgreSQL 14
DBAPI (eg: psycopg, cx_oracle, mysqlclient): psycopg2

Additional context

What I would propose is: to provide BindParameter.__init__() with an additional boolean parameter, like keep_unexpanded and skip converting it into expanded in op_in() post-coercion if True.

CaselIT commented 1 year ago

Hi,

I think this is out of scope. Compiling a query to get a string, and then using it in text it's not something that's supported or even tested. Also in most cases the output of the compilation will be dialect specific and that does not use :name as bind parameter argument.

Also another thing to consider for this particular case, is that most dbapi don't support the form col in :param, but need the single parameters there to be rendered, as in col in (:param1, :param2, :paramn). the outlier here is psycopg2

@zzzeek feel free to reopen if you don't agree

AndySemenov commented 1 year ago

it's not something that's supported or even tested

Works for me. Works like a charm.

Also in most cases the output of the compilation will be dialect specific and that does not use :name as bind parameter argument.

I've used the SQLAlchemy's machinery to generate all the queries' texts in my project:

PG_DIALECT = postgresql.dialect(paramstyle="named")

Also another thing to consider for this particular case, is that most dbapi don't support the form col in :param, but need the single parameters there to be rendered, as in col in (:param1, :param2, :paramn). the outlier here is psycopg2

Meanwhile psycopg2 follows PostgreSQL's way of handling named parameters. For example, in DataGrip (which has nothing to do with Python), a query

SELECT * FROM my_table WHERE id IN :ids_param

Executes with no issues, provided I set ids_param with a value like ('a', 'b')

zzzeek commented 1 year ago

So API support, if anything, it would be that you pass expanding=False and that would be an indicator for us to not touch it, I would allow that.

however you can get close right now, you want the parameter to be hardcoded to a single position, OK so:

bindparam("input_values", [None])

then expand that:

query_text = query.compile(
    dialect=postgresql.dialect(), compile_kwargs={"render_postcompile": True}
).string

and you get

SELECT our_table.col_a 
FROM our_table 
WHERE our_table.col_b IN (%(input_values_1)s)

the "_1" is inconvenient, so you'd have to search-and-replace out that _1 on it, is one way.

the other way is to flip off the expanding switches after the fact:

from sqlalchemy.sql.visitors import traverse

def reset_expanding(stmt):
    def unset_expanding(element, **kw):
        element.expanding = False

    traverse(stmt, {}, {"bindparam": unset_expanding})

    return stmt

then you get

SELECT our_table.col_a 
FROM our_table 
WHERE our_table.col_b IN %(input_values)s

as @CaselIT said, this syntax does not work on most DBAPIs. it might work on psycopg2 but I dont know if it even works on other postgresql drivers. I had looked into it as a solution we could use for this early on and i noticed most drivers seemed to not do it.

from sqlalchemy import bindparam
from sqlalchemy import column
from sqlalchemy import select
from sqlalchemy import table
from sqlalchemy import Unicode
from sqlalchemy.dialects import postgresql
from sqlalchemy.sql.visitors import traverse

tbl = table(
    "our_table",
    column(
        "col_a",
        Unicode(),
    ),
    column("col_b", Unicode()),
)

query = select(tbl.columns.col_a).where(
    tbl.columns.col_b.in_(bindparam("input_values"))
)

def reset_expanding(stmt):
    def unset_expanding(element, **kw):
        element.expanding = False

    traverse(stmt, {}, {"bindparam": unset_expanding})

    return stmt

query = reset_expanding(query)

query_text = query.compile(
    dialect=postgresql.dialect()
).string
print(query_text)

CaselIT commented 1 year ago

it's not something that's supported or even tested

Works for me. Works like a charm.

since an issue was opened, not really I guess..

Also in most cases the output of the compilation will be dialect specific and that does not use :name as bind parameter argument.

I've used the SQLAlchemy's machinery to generate all the queries' texts in my project:
PG_DIALECT = postgresql.dialect(paramstyle="named")

is executing the query directly not an option?

Also another thing to consider for this particular case, is that most dbapi don't support the form col in :param, but need the single parameters there to be rendered, as in col in (:param1, :param2, :paramn). the outlier here is psycopg2

Meanwhile psycopg2 follows PostgreSQL's way of handling named parameters. For example, in DataGrip (which has nothing to do with Python), a query
SELECT * FROM my_table WHERE id IN :ids_param
Executes with no issues, provided I set ids_param with a value like ('a', 'b')

that's a particularity of psycopg2 that does parameter binding client side. PostgreSQL will reject such a query if server side binding is used. If you try to use psycopg 3 or asyncpg they will both error (in psycopg case it's reported the error from postgresql )

with e.connect() as c:
    print(c.execute(sa.text('select 2 where 2 in :x'), {'x': (1,2,3)}).all())

# asyncpg
 <class 'asyncpg.exceptions.PostgresSyntaxError'>: syntax error at or near "$1"
[SQL: select 2 where 2 in %s]
[parameters: ((1, 2, 3),)]
# psycopg3 (the error comes from postgresql)
 (psycopg.errors.SyntaxError) syntax error at or near "$1"
LINE 1: select 2 where 2 in $1
                            ^
[SQL: select 2 where 2 in %(x)s]
[parameters: {'x': (1, 2, 3)}]

CaselIT commented 1 year ago

@zzzeek a better workaround is to use a custom op('IN'). The following will work as expected:

our_q = select(our_table.columns.col_a).where(our_table.columns_col_b.op('IN')(bindparam("input_value")))

edit: bool_op is probably better, it event types right

it might work on psycopg2 but I dont know if it even works on other postgresql drivers. I had looked into it as a solution we could use for this early on and i noticed most drivers seemed to not do it.

no, pg will reject it for any dbapi that does server side binding

AndySemenov commented 1 year ago

is executing the query directly not an option?

No, it's not since the environment is a one-shot so it would lead to executing statement-to-string compilation every time we need to call almost every query. Moreover, even ORM data models instantiation upon imports would add into that count. And that takes extremely long compared to having just a ready-to-go SQL query text compiled upon artifact compilation and deployment. There's no code using ORM in runtime, though we use ORM extensively just to ease the code's supportability.

Anyway. Thanks. I'll try to find another way of handling the cases where IN keyword looks an obvious solution.

zzzeek commented 1 year ago

No, it's not since the environment is a one-shot so it would lead to executing statement-to-string compilation every time we need to call almost every query.

meaning, there's no opportunity for the statements to be cached? How many statements is the script running (which we assume are all unique statements, so no caching benefit) and what performance issues are you observing?

Moreover, even ORM data models instantiation upon imports would add into that count.

I dont see how ORM models need to be involved. You can execute with Core. the compilation overhead is not very bad, and it's also cached.

And that takes extremely long compared to having just a ready-to-go SQL query text compiled upon artifact compilation and deployment.

we have caching, so ...im not understanding where the performance issue is being observed.

AndySemenov commented 1 year ago

meaning, there's no opportunity for the statements to be cached? How many statements is the script running (which we assume are all unique statements, so no caching benefit) and what performance issues are you observing?

I guess, you are speaking of a statement compilation result being cached somewhere deep inside the properties of the statement object (like Executable) after it has been first executed and hence compiled.

Our environment starts, executes and dies. We can't cache the object's compilation results in runtime. We do not want to bring in some custom logic with something like Redis involved to cache the pickled objects as well (and the logic is gonna be too complex since there's no easy way to distinguish if the code for an Executable brought in a change or a fix to the one being pickle-stored in the cache) since it'll just bring another point of failure and another costs line into the bill.

We pay for the CPU time taken. So we could tolerate importing SQLAlchemy for its huge amount of features easing DB connections management and executing queries in the DB within or without transactions and those pretty context managers around those... But if we could lower the amount of time being taken on importing ORM models, settling MetaData with them, compiling every query to be executed, we'd prefer that. For now that works for us perfectly (except the IN clause which is not an easy thing in either of the languages as I've figured out after some assessment) as we've managed to create a queries' compilation step in our dev/build/deploy scripts.

So, despite I don't bring in the timing tests, it's still obvious that

from a import b
from c import d

will take more CPU time than

from a import b

provided a does not implicitly import c. In Python. At least, CPython which is our case.

zzzeek commented 1 year ago

meaning, there's no opportunity for the statements to be cached? How many statements is the script running (which we assume are all unique statements, so no caching benefit) and what performance issues are you observing?

I guess, you are speaking of a statement compilation result being cached somewhere deep inside the properties of the statement object (like Executable) after it has been first executed and hence compiled.

it's cached within the Engine, see https://docs.sqlalchemy.org/en/14/core/connections.html#sql-compilation-caching

Our environment starts, executes and dies. We can't cache the object's compilation results in runtime. We do not want to bring in some custom logic with something like Redis involved to cache the pickled objects as well (and the logic is gonna be too complex since there's no easy way to distinguish if the code for an Executable brought in a change or a fix to the one being pickle-stored in the cache) since it'll just bring another point of failure and another costs line into the bill.

We pay for the CPU time taken. So we could tolerate importing SQLAlchemy for its huge amount of features easing DB connections management and executing queries in the DB within or without transactions and those pretty context managers around those... But if we could lower the amount of time being taken on importing ORM models, settling MetaData with them, compiling every query to be executed, we'd prefer that.

OK so, that means your architecture spins up the entire Python interpreter, which is massively expensive compared to SQL compilation overhead, but IIUC the actual expense you want to avoid is the import of the ORM models themselves, so SQL compilation overhead is not the problem, ORM models are. is that correct? That's a less alarming answer then. There's other ways to do this, like use table() column() objects, but that's likely not any easier than a small workaround for IN expressions. if/when you do change drivers, you will need to change your approach.

AndySemenov commented 1 year ago

OK so, that means your architecture spins up the entire Python interpreter, which is massively expensive compared to SQL compilation overhead

We don't pay for the interpreter spin-up. And we are kinda assured that spin-up takes a very low amount of time. But we pay for the CPU time since the interpreter has reported it's up and ready to run the code. Since then we avoid compilation of the queries and spending time in importing ORM models and even building the queries with the CORE query builders.

We still don't want to use core table() and column() since they have not backtracking feedback to IDEs so refactoring data models becomes a pain. We generate a bunch of python files having something like:

from sqlalchemy import text

ThisFlowQuery = text("SELECT ...")
ThatFlowQuery = text("INSERT ...")

And since compiling the queries into the texts does the trick, we are good since the only thing we need to do in the runtime is:


from db.engine import ENGINE_CONFIGURED
from db.queries.flow_1 import ThisFlowQuery, ThatFlowQuery

with ENGINE_CONFIGURED.begin() as tr:
  tr.execute(ThisFlowQuery, dict(param1=value1, param2=value2))
  ...

That's why Executable to SQL string compilation is a good feature. And if that's not like impossible, having it 100% working in SQLAlchemy would be a killer feature. Works for most of the cases for us but could provide with even more benefit if we don't need to wrap some static parameters with literal_column() in order to prevent spawning of :anon_param_1 or something like this in the compiled text.

zzzeek commented 1 year ago

IMO it does work 100%. the "col IN (?)" idea does not work on any driver other than psycopg2, and psycopg2 is an obsolete driver, so all users will need the params spelled out, which means the number of IN positions must be known ahead of time.

and for that, if you hardcode the length of the array you want in your IN, you can get the params like this:

query = select(tbl.columns.col_a).where(
    tbl.columns.col_b.in_(bindparam("input_values", [None, None, None]))
)

query_text = query.compile(
    dialect=postgresql.dialect(),
    compile_kwargs={"render_postcompile": True}
).string
print(query_text)

then you get your three positions:

SELECT our_table.col_a 
FROM our_table 
WHERE our_table.col_b IN (%(input_values_1)s, %(input_values_2)s, %(input_values_3)s)

AndySemenov commented 1 year ago

Sorry for that. I've driven the discussion far beyond the requirement of this ticket. I don't dispute it's closed with won't fix (out of scope). That's it. If you are really interested in what's going on in our project with the Executable to SQL query compilation, lets move to somewhere else.

Thanks for pointing me on the issue with my request itself.

zzzeek commented 1 year ago

no worries! Im as unhappy with how drivers handle IN as anyone. there's just not much we can do beyond what we are doing.