SELECT query hits param limit on SQLAlchemy ('h' format limit 32767)

sebastian-correa commented 8 months ago

Hi! First off, thanks for PG8000. We've been using it in production to connect to CloudSQL instances and it's simplified our setup significantly.

We have many queries that do SELECTs from tables that have a WHERE columna_a IN (...) filter where the list of valid values can be very long (the max is very rare but is ~50k items). I know it's not the best, so we'd be open to redesigning the queries altogether, but we don't see any viable alternatives at the moment.

I know the error is actually because Postgres has a limit in the number of params in each query, but this is handled transparently by psycopg+SQLAlchemy (which we use for local development) but not by pg8000+SQLAlchemy. What's the best way to handle this query with pg8000?

These are some alternatives that we'd rather avoid because they complicate our code quite a lot:

Manually batch the queries.
Create a temp Postgres table to insert the values and JOIN with out table [ref]. This would probably make the select quite slow unless we insert to the table using COPY FROM.

Sorry if this is really an issue with how SQLAlchemy behaves when using pg8000! Please let me know if that's the case so we can open a similar question to them.

Versions:

Postgres 14.9.
Python 3.10.13.
pg8000==1.30.4
SQLAlchemy==2.0.25
psycopg2-binary==2.9.9

tlocke commented 8 months ago

Hi @sebastian-correa, glad you're enjoying pg8000! With the PostgreSQL protocol you can either set the types of parameters explicitly or you can just let the server infer the parameter types. If you want to set them explicitly then you use the setinputsizes method. If you set them explicitly the PostgreSQL protocol has a limit on the maximum number of parameters, which is what you're running into. SQLAlchemy always uses setinputsizes so you're always going to run into this limit with SQLAlchemy.

So one option would be just to use pg8000 directly and not call setinputsizes. Another option is to rewrite the SQLAlchemy dialect for pg8000 to not call setinputsizes. I think SQLAlchemy currently does it this way for historical reasons, and I'm sure altering the dialect would be doable and there wouldn't be too much of a problem with backward compatibility, it might even fix some existing bugs with the dialect!

sebastian-correa commented 8 months ago

Hi @tlocke. So if I understand this correctly, there's nothing that can be done from thepg8000 side? Do you know how psychopg manages this?

To your dialect rewrite suggestion, is this something I could easily do myself and patch it in to SQLAlchemy? If so, it might be worth a shot and, maybe, the implementation could become part of pg8000 so that users that need it can manually patch it in (?). I don't know the internals of any of the 2 libraries, but I'm happy to look into it if you give me some pointers :).

tlocke commented 8 months ago

Actually ignore what I said about avoiding setinputsizes, even if you did avoid it there would still be a problem, apologies! So I think the only way around it would be to send the query as a simple string rather then sending a string with placeholders and parameters separately. I think psycopg just sends a simple string. So here is an example that reproduces the problem:

import pg8000.dbapi

conn = pg8000.dbapi.connect(user="postgres", password="cpsnow")
cursor = conn.cursor()
SIZE = 100000
cursor.execute(
    f"SELECT 1 WHERE 1 IN ({','.join(['%s'] * SIZE)})",
    [1] * SIZE,
)
conn.close()

It'll give the error:

Traceback (most recent call last):
  File "/home/tlocke/pg8000/example.py", line 6, in <module>
    cursor.execute(
  File "/home/tlocke/pg8000/pg8000/dbapi.py", line 471, in execute
    self._context = self._c.execute_unnamed(
  File "/home/tlocke/pg8000/pg8000/core.py", line 701, in execute_unnamed
    self.send_BIND(NULL_BYTE, params)
  File "/home/tlocke/pg8000/pg8000/core.py", line 765, in send_BIND
    NULL_BYTE + statement_name_bin + h_pack(0) + h_pack(len(params))
struct.error: 'h' format requires -32768 <= number <= 32767

But if you create a simple string it will work:

import pg8000.dbapi
from pg8000.native import literal

conn = pg8000.dbapi.connect(user="postgres", password="cpsnow")

cursor = conn.cursor()
SIZE = 100000
cursor.execute(f"SELECT 1 WHERE 1 IN ({','.join([literal(1)] * SIZE)})")
conn.close()

sebastian-correa commented 8 months ago

While it's true this would work and could be done from a "centralized" place (we have a query method where we could compile the SELECT query with the params into a string), it'd entail extra effort to correctly stringify "weird types" like UUIDs and such.

SQLAlchemy also seems to recommend against this [ref]:

The reason SQLAlchemy does not support full stringification of all datatypes is threefold:

This is a functionality that is already supported by the DBAPI in use when the DBAPI is used normally. (...)

Stringifying with bound parameters inlined for specific databases suggests a usage that is actually passing these fully stringified statements onto the database for execution. This is unnecessary and insecure, and SQLAlchemy does not want to encourage this use in any way.

The area of rendering literal values is the most likely area for security issues to be reported. SQLAlchemy tries to keep the area of safe parameter stringification an issue for the DBAPI drivers as much as possible where the specifics for each DBAPI can be handled appropriately and securely.

I'm not sure whether using Select.compile(engine) would solve these security concerns or not, though.

As extra comment, we asked about this in SQLAlchemy a while back as well, but they didn't have any particularly good solution either [ref].

What are your thoughts? Should we stringify everything anyways or batch the query somehow or what?

tlocke commented 8 months ago

I've just found a bug in pg8000, and fixing it means that the number of parameters you can send is now doubled. It's in version 1.30.5, maybe that solves the problem for now?

sebastian-correa commented 8 months ago

@tlocke, thanks for the work! In practice, this helps a lot as we won't hit the limit as much (almost never during production usage with the new limit). However, this is a critical query and we can't risk having our application stop/fail due to a parameter limit issue, so we still have to do something about this on our end.

I think we'll end up migrating our solution away from cloud-sql-python-connector to running the proxy manually and calling it a day. This should help in our use-case, though it doesn't technically handle the limit with pg8000 :(.

I'd rather not do this, as the Python connector + pg8000 solution is much easier to use, but given that we can't solve this in a sure-fire way with the combo we have to look into the proxy :(.

tlocke commented 8 months ago

One way of doing it is to use a single parameter that's an array, and then use the unnest() function:

import pg8000.dbapi

conn = pg8000.dbapi.connect(user="postgres", password="cpsnow")
cursor = conn.cursor()
cursor.execute(
    "SELECT 1 WHERE 1 IN (SELECT unnest(CAST(%s AS int[])))",
    [[1] * 100000],
)
conn.close()

I've added an example to the documentation https://github.com/tlocke/pg8000/tree/main?tab=readme-ov-file#parameter-limit

sebastian-correa commented 7 months ago

That's very clever. I'll give it a shot!

sebastian-correa commented 7 months ago

Worked! Thanks!

tlocke / pg8000

SELECT query hits param limit on SQLAlchemy ('h' format limit 32767) #150