googleapis / python-bigquery-sqlalchemy

SQLAlchemy dialect for BigQuery
MIT License
435 stars 130 forks source link

GROUP BY clause will not properly rendered if there is a function applied column expression #879

Closed sgryjp closed 8 months ago

sgryjp commented 1 year ago

Environment details

Steps to reproduce

  1. Compose an SQL statement using SQLAlchemy Core or ORM which includes a GROUP BY clause with an expression timestamp_trunc(`some_column`, hour)
  2. Render it (or execute it and confirm how it was rendered by examining SQLAlchemy's log)

Code example

Below is an example script to reproduce the symptom. Please refer to the first comment block in the file for detail.

"""An example script to reproduce the bug.

Executing this script outputs (partially folded for ease of reading):

    sqlalchemy:          1.4.49
    sqlalchemy-bigquery: 1.6.1
    SQLite----------------------------------------
    SELECT timestamp_trunc(sensor.time, hour) AS time,
           max(sensor.value) AS max_1
    FROM sensor
    GROUP BY timestamp_trunc(sensor.time, hour)
    BigQuery----------------------------------------
    SELECT timestamp_trunc(`sensor`.`time`, hour) AS `time`,
           max(`sensor`.`value`) AS `max_1`
    FROM `sensor`
    GROUP BY `timestamp_trunc_1`

In the final line, there is a reference to a non-existing colunn
of which name is `timestamp_trunc_1`. The line should be rendered as:

    GROUP BY timestamp_trunc(`sensor`.`time`, hour)

...like the rendering result for SQLite, written above.
"""
import sqlalchemy as sa
import sqlalchemy.dialects.sqlite as sqlite
import sqlalchemy_bigquery as bigquery

print("sqlalchemy:         ", sa.__version__)
print("sqlalchemy-bigquery:", bigquery.__version__)

engine = sa.create_engine("sqlite://")
metadata = sa.MetaData()

sensor_table = sa.Table(
    "sensor",
    metadata,
    sa.Column("time", sa.TIMESTAMP),
    sa.Column("value", sa.String),
)

stmt = sa.select(
    sa.func.timestamp_trunc(sensor_table.c.time, sa.text("hour")).label("time"),
    sa.func.max(sensor_table.c.value),
).group_by(
    sa.func.timestamp_trunc(sensor_table.c.time, sa.text("hour")),
)

print("SQLite" + "-" * 40)
print(
    stmt.compile(
        dialect=sqlite.dialect(),
        compile_kwargs={"literal_binds": True},
    ),
)

print("BigQuery" + "-" * 40)
print(
    stmt.compile(
        dialect=bigquery.dialect(),
        compile_kwargs={"literal_binds": True},
    ),
)

Stack trace

N/A (no exception will be raised unless you send it to BigQuery)

liutmelody commented 9 months ago

Would love to know the cause of this! I'm getting the same error when using ROLLUP inside the groupby clause: Example code:

import sqlalchemy as sa
from sqlalchemy_bigquery import BigQueryDialect

        query = (
            sa.select(sql_projections)
            .select_from(model_expression)
            .where(sa.and_(*filter_expressions))
            .group_by(func.rollup(*groupby_expressions)) // <-- relevant rollup expression 
            .having(sa.and_(*having_expressions))
        )

        query_str = str(
            query.compile(dialect=BigQueryDialect())
        )

Output Query:

SELECT timestamp_trunc(`created_date`, YEAR) AS `created_date`,
       `agency` AS `agency`,
       `city` AS `city`,
       count(*) AS `row_count`,
FROM `quickstart.nyc_311`
GROUP BY `rollup_1` // <-- error here, this should be ROLLUP(1, 2, ..) as it is in other dialects 

I've been working around it by using sa.text but would appreciate some insight into why this is happening.

kiraksi commented 9 months ago

Working on implementing group by, rollup and cube clauses as part of another customer issue, which should include a fix for this

kiraksi commented 8 months ago

@sgryjp @liutmelody This fix is now a part of branch development-build-1.11.0.dev3. I will be setting up a prerelease soon on PyPI as well for this development release, the reason being that this branch is based off our SQLAlchemy 1.4-2.0+ migration work. Please try it out and let me know if it solves your problem, if it doesn't feel free to reopen this issue.

sgryjp commented 8 months ago

@kiraksi Hi, thank you for fixing this issue! I confirmed that the problem was fixed by your commit 5cfc28089baa6106cf30f9efb268231792da9251.

I also confirmed that the branch development-build-v1.11.0.dev3 does reproduce the problem. It seems that the commit bd38a5e14145ca77ec462e7cf4e9a989f2eb1fb3 somehow gets the bug back again. Could you check if there is anything suspicious? Thank you in advance.

```powershell (.venv) PS> git log --oneline development-build-v1.11.0.dev3 87a75dc (HEAD, origin/development-build-v1.11.0.dev3, development-build-v1.11.0.dev3) create development build 1.11.0.dev3 0c882b9 test commit 033d329 reformat logic 5154814 added test case bd38a5e fixed render label as label assignment 68afc39 test basic implementation of group_by_clause and visit_label ece7f1f removed unnecessary clause function changes, edited tests e82f5dd test commit to run kokooro tests 310cfa7 Merge branch 'development-build-v1.11.0.dev2' into grouping-rollup-cube 0c4cf07 (tag: v1.11.0.dev2) create development release 1.11.0.dev2 5cfc280 feat: grouping sets, rollup and cube compatibility ...(log continues)... (.venv) PS> pip install --force-reinstall git+https://github.com/googleapis/python-bigquery-sqlalchemy.git@bd38a5e14145ca77ec462e7cf4e9a989f2eb1fb3 >$null; python .\issue879.py Running command git clone --filter=blob:none --quiet https://github.com/googleapis/python-bigquery-sqlalchemy.git 'C:\Users\sgryjp\AppData\Local\Temp\pip-req-build-d9h_kbnv' Running command git rev-parse -q --verify 'sha^bd38a5e14145ca77ec462e7cf4e9a989f2eb1fb3' Running command git fetch -q https://github.com/googleapis/python-bigquery-sqlalchemy.git bd38a5e14145ca77ec462e7cf4e9a989f2eb1fb3 Running command git checkout -q bd38a5e14145ca77ec462e7cf4e9a989f2eb1fb3 sqlalchemy: 2.0.27 sqlalchemy-bigquery: 1.11.0.dev2 SQLite---------------------------------------- SELECT timestamp_trunc(sensor.time, hour) AS time, max(sensor.value) AS max_1 FROM sensor GROUP BY timestamp_trunc(sensor.time, hour) BigQuery---------------------------------------- SELECT timestamp_trunc(`sensor`.`time`, hour) AS `time`, max(`sensor`.`value`) AS `max_1` FROM `sensor` GROUP BY `timestamp_trunc_1` (.venv) PS> pip install --force-reinstall git+https://github.com/googleapis/python-bigquery-sqlalchemy.git@68afc3959d988b77a244699e7d5e4f39bd233917 >$null; python .\issue879.py Running command git clone --filter=blob:none --quiet https://github.com/googleapis/python-bigquery-sqlalchemy.git 'C:\Users\sgryjp\AppData\Local\Temp\pip-req-build-hywz32py' Running command git rev-parse -q --verify 'sha^68afc3959d988b77a244699e7d5e4f39bd233917' Running command git fetch -q https://github.com/googleapis/python-bigquery-sqlalchemy.git 68afc3959d988b77a244699e7d5e4f39bd233917 Running command git checkout -q 68afc3959d988b77a244699e7d5e4f39bd233917 sqlalchemy: 2.0.27 sqlalchemy-bigquery: 1.11.0.dev2 SQLite---------------------------------------- SELECT timestamp_trunc(sensor.time, hour) AS time, max(sensor.value) AS max_1 FROM sensor GROUP BY timestamp_trunc(sensor.time, hour) BigQuery---------------------------------------- SELECT timestamp_trunc(`sensor`.`time`, hour) AS `time`, max(`sensor`.`value`) AS `max_1` FROM `sensor` GROUP BY timestamp_trunc(`sensor`.`time`, hour) ```
kiraksi commented 8 months ago

@sgryjp This is odd, the previous commit you mentioned was a fix that introduced other issues, I created a new fix that is now on development-build-v1.11.0.dev3, I even confirmed the fix by adding various test cases. Can you confirm again whether development-build-v1.11.0.dev3 does not fix your issue with the new commits there? This may actually be an issue with the labeling from the within_group_by=True value passed in to group by for timestamp_trunc. Please let me know, otherwise its a quick fix to add TIMESTAMP_TRUNC to the list of exclusions for that label.

sgryjp commented 8 months ago

@kiraksi Thank you for response!

I confirmed that neither of tag v1.11.0.dev3 nor HEAD of the branch development-build-v1.11.0.dev3 does not fix my issue. Running the "code example" in the first comment results the same as below:

```sh $ pip install --force-reinstall git+https://github.com/googleapis/python-bigquery-sqlalchemy.git@v1.11.0.dev3; python issue879.py ...(omitted)... sqlalchemy: 2.0.27 sqlalchemy-bigquery: 1.11.0.dev3 SQLite---------------------------------------- SELECT timestamp_trunc(sensor.time, hour) AS time, max(sensor.value) AS max_1 FROM sensor GROUP BY timestamp_trunc(sensor.time, hour) BigQuery---------------------------------------- SELECT timestamp_trunc(`sensor`.`time`, hour) AS `time`, max(`sensor`.`value`) AS `max_1` FROM `sensor` GROUP BY `timestamp_trunc_1` ```