ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.2k stars 590 forks source link

bug: trino/starburst unexpected behavior with .sql #7050

Closed lostmygithubaccount closed 1 year ago

lostmygithubaccount commented 1 year ago

What happened?

this is with the Trino backend connected to a demo Starburst Galaxy account using the sample dataset they provide, it should be relatively easily reproducible. in summary for some reason passing in SQL to .sql() doesn't work as expected; the equivalent Ibis is fine; the SQL versions aren't ordered as expected

[ins] In [1]: import os # <1>
         ...: import ibis # <1>
         ...: from dotenv import load_dotenv # <1>
         ...:
         ...: # configure Ibis # <2>
         ...: ibis.options.interactive = True # <2>
         ...:
         ...: # load env variables # <3>
         ...: load_dotenv() # <3>
         ...: user = os.getenv("USERNAME") # <3>
         ...: password = os.getenv("PASSWORD") # <3>
         ...:
         ...: # variables # <4>
         ...: host = os.getenv("HOSTNAME") # <4>
         ...: port = os.getenv("PORTNUMBER") # <4>
         ...: catalog = "sample" # <4>
         ...: schema = "demo" # <4>
         ...:
         ...: # connect to database # <5>
         ...: con = ibis.trino.connect( # <5>
         ...:     user=user, password=password, host=host, port=port, dat
         ...: abase=catalog, schema=schema # <5>
         ...: ) # <5>
         ...: con
Out[1]: <ibis.backends.trino.Backend at 0x1060f5d90>

[ins] In [2]: astronauts = con.table("astronauts") # <1>
         ...: missions = con.table("missions") # <1>

[ins] In [3]: sql = f"""
         ...: SELECT
         ...:   name,
         ...:   count() AS nbr_missions
         ...: FROM
         ...:   sample.demo.astronauts
         ...: GROUP BY
         ...:   name
         ...: ORDER BY
         ...:   nbr_missions DESC;
         ...: """.strip().strip(";")
         ...: print(sql)
SELECT
  name,
  count() AS nbr_missions
FROM
  sample.demo.astronauts
GROUP BY
  name
ORDER BY
  nbr_missions DESC

[ins] In [5]: con.sql(sql)
Out[5]:
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ name                   ┃ nbr_missions ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ string                 │ int64        │
├────────────────────────┼──────────────┤
│ Gagarin, Yuri          │            1 │
│ Titov, Gherman         │            1 │
│ Tereshkova, Valentina  │            1 │
│ Feoktistov, Konstantin │            1 │
│ Leonov, Aleksei        │            2 │
│ Grissom, Virgil I.     │            1 │
│ McDivitt, James A.     │            2 │
│ White, Edward H., II   │            1 │
│ Borman, Frank          │            2 │
│ Lovell, James A., Jr.  │            4 │
│ …                      │            … │
└────────────────────────┴──────────────┘

[ins] In [6]: astronauts.group_by("name").agg(nbr_missions=ibis._.count()
         ...: ).order_by(
         ...:     ibis._["nbr_missions"].desc()
         ...: )
Out[6]:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ name                     ┃ nbr_missions ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ string                   │ int64        │
├──────────────────────────┼──────────────┤
│ Ross, Jerry L.           │            7 │
│ Chang-Diaz, Franklin R.  │            7 │
│ Wetherbee, James D.      │            6 │
│ Malenchenko, Yuri        │            6 │
│ Young, John W.           │            6 │
│ Musgrave, Franklin Story │            6 │
│ Brown, Curtis L., Jr.    │            6 │
│ Krikalev, Sergei         │            6 │
│ Foale, C. Michael        │            6 │
│ Ivins, Marsha S.         │            5 │
│ …                        │            … │
└──────────────────────────┴──────────────┘

[ins] In [7]: print(ibis.to_sql(astronauts.group_by("name").agg(nbr_missi
         ...: ons=ibis._.count()).order_by(
         ...:     ibis._["nbr_missions"].desc()
         ...: )))
SELECT
  t0.name,
  t0.nbr_missions
FROM (
  SELECT
    t1.name AS name,
    COUNT(*) AS nbr_missions
  FROM "sample".demo.astronauts AS t1
  GROUP BY
    1
) AS t0
ORDER BY
  t0.nbr_missions DESC

[nav] In [8]: con.sql(ibis.to_sql(astronauts.group_by("name").agg(nbr_mis
         ...: sions=ibis._.count()).order_by(
         ...:     ibis._["nbr_missions"].desc()
         ...: )))
Out[8]:
┏━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ name                    ┃ nbr_missions ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ string                  │ int64        │
├─────────────────────────┼──────────────┤
│ Glenn, John H., Jr.     │            2 │
│ Carpenter, M. Scott     │            1 │
│ Nikolayev, Andriyan     │            2 │
│ Popovich, Pavel         │            2 │
│ Schirra, Walter M., Jr. │            3 │
│ Cooper, L. Gordon, Jr.  │            2 │
│ Bykovsky, Valery        │            3 │
│ Komarov, Vladimir       │            2 │
│ Yegorov, Boris          │            1 │
│ Belyayev, Pavel         │            1 │
│ …                       │            … │
└─────────────────────────┴──────────────┘

What version of ibis are you using?

master

What backend(s) are you using, if any?

trino

Relevant log output

see above

Code of Conduct

cpcloud commented 1 year ago

I understand the motivation for the example, but it's quite long.

The long example is fine, but for the future would you mind showing a separate code block that contains only the failure? It's hard to see what's happening because of all the setup code.

cpcloud commented 1 year ago

Also you don't need to keep strip().strip(";")-ing. Are you seeing failures when you don't do that?

cpcloud commented 1 year ago

Another thing that would help cut the example down is making some variables of the expressions instead of duplicating them only to show their sql.

cpcloud commented 1 year ago

Sorry, I keep noticing things 😄

missions is unused, which adds noise to understanding the problem

cpcloud commented 1 year ago

Hm, there's a final outer projection, which will erase the ORDER BY.

It seems to only show up in the trino backend curiously enough: there's no outer projection generated for the duckdb backend.

cpcloud commented 1 year ago

Ok, it's not really a duckdb vs trino vs other issue, we're generating an extra project for all sqlalchemy backends. It just happens to be the case that DuckDB will presents the results in the same order every time, but it's definitely not something we should depend on.