googleapis / python-bigquery-pandas

Google BigQuery connector for pandas
https://googleapis.dev/python/pandas-gbq/latest/index.html
BSD 3-Clause "New" or "Revised" License
447 stars 121 forks source link

Running a parameterized query via a SQLAlchemy Selectable #478

Open rjrudin opened 2 years ago

rjrudin commented 2 years ago

Is your feature request related to a problem? Please describe.

I'm back again, after submitting #474 and #476, as we're trying to shift from using pd.read_sql_query to using pandas_gbq.read_gbq .

Current issue is that when we construct our SQLAlchemy Selectable, we may toss a row limit onto it based on user input, e.g:

sql: Select = select(literal_column("*")).select_from(...)
sql = sql.limit(50)

pands_gbq.read_gbq wants a str as its query argument. So we're trying to compile the Selectable and serialize it as a string:

query = str(sql.compile(self.engine))

However, SQLAlchemy serializes its named parameters in a way that I can't get to work with pandas_gbq (and more generally, that won't work with BigQuery):

SELECT * 
FROM `some_dataset.some_table`
LIMIT %(param_1:INT64)s

I found an example in the pandas_gbq tests of how to configure query/queryParameters in the "configuration" dict that can be passed to read_gbq, but that of course doesn't help because BigQuery has its own way of naming parameters.

Describe the solution you'd like

Generally, I need a way (hopefully it exists already and I'm just missing it!) of serializing a SQLAlchemy Selectable into a string query that can be the input to read_gbq. For now, just being able to handle the limit() parameter above would suffice. I of course am trying to avoid string concatenation and am hoping there's some way I can configure the "configuration" dict to make this work.

Describe alternatives you've considered

N/A

Additional context

N/A

rjrudin commented 2 years ago

Not sure if this is relevant - using SQLAlchemy 1.4.25 and sqlalchemy-bigquery 1.3.0, when I run this:

from sqlalchemy_bigquery import BigQueryDialect
str(sql.compile(dialect=BigQueryDialect))

I get this error:

AttributeError: type object 'BigQueryDialect' has no attribute 'positional'

I tried using "setattr" to fake-fix that, but I get several other errors about attributes missing. Not sure if that would work, and so I've stuck with just doing compile(self.engine) or compile(bind=self.engine).

rjrudin commented 2 years ago

If this is really an issue for https://github.com/googleapis/python-bigquery-sqlalchemy , I'm happy to move it there too.

tswast commented 2 years ago

This is a tough one. I think it does belong here, as the SQLAlchemy connector needs to continue to generate SQL that can be understood by the google.cloud.bigquery.dbapi modules.

I would like to allow this library to take SQLAlchemy-generated SQL somehow, possibly via an optional dependency on SQLAlchemy so pandas-gbq can run the compile itself.

willsthompson commented 1 year ago

I assume it's built this way because DBAPI doesn't include a BigQuery-compatible parameter substitution style, but the sqlalchemy-bigquery dialect compiles SQL into the pyformat style, then at runtime transforms the pyformatted query into a BigQuery compatible one and builds params that can be executed against a bigquery.Client. Pretty confusing, but I pulled out the relevant pieces into a function that does the conversion

from google.cloud.bigquery.dbapi._helpers import to_query_parameters
from google.cloud.bigquery.dbapi.cursor import _format_operation

query = select(...) # SQLAlchemy query
compiled = query.compile(engine)  # engine using BigQueryDialect or dialect=BigQueryDialect
parameters = compiled.params
bq_query, parameter_types = _format_operation(str(compiled), parameters)
job_config = bigquery.QueryJobConfig(
    query_parameters=to_query_parameters(parameters, parameter_types)
)
# client = bigquery.Client(...)
query_job = client.query(bq_query, job_config)

This is pretty boiled down and doesn't address any edge cases, so YMMV, but it works for the relatively small scope of queries we expect.