ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.29k stars 596 forks source link

feat: how to access multiple table aliases with `.sql`? #10436

Open noklam opened 1 week ago

noklam commented 1 week ago

Is your feature request related to a problem?

I am trying to use the SQL expression for multiple CTE but the Table.sql only allows for a single table.

What is the motivation behind your request?

I found https://ibis-project.org/how-to/extending/sql very useful as an intermediate step to migrate existing SQL scripts to ibis. Instead of going from SQL -> ibis immediately, it is sometimes useful to breakdown a complex SQL into multiple ibis expressions (using the .sql). This requires minimal effort but already make the SQL scripts much more debuggable.

Consider this example

with a as (select * from raw_a),
with b as (select * from raw_b)
select a.*, b.* where a.id=b.id

To break it down with ibis, I start with copying the CTE as follow

a = con.sql("select * from raw_a").alias("a")
b = con.sql("select ( from raw_b").alias("b")

All follow option will fail as it doesn't understand the table alias

con.sql("select a.*, b.* from a where a.id=b.id") #doesn't know a & b
a.sql("select a.*, b.* from a where a.id=b.id") # doesn't know b
b.sql("select a.*, b.* from a where a.id=b.id") # doesn't know a

This make it possible to debug, or inspect data quickly inside a notebook. Then I want to join them together but I cannot find the correct API. Backend.sql only recognize tables that are exists already (raw_a, raw_b).

while a.sql() will only recognize a but not b, and b.sql() only recognize b but not a. Chaining ibis expression with SQL is powerful but this seems to limited to a single table, which make joins impossible.

I know this is doable if I go for full ibis dataframe code, but this is not what I want here.

Describe the solution you'd like

But some reason the documentation to extend SQL with expression is quite hidden. I have browsed the documentation a couple of times before but didn't know this feature exist.

What version of ibis are you running?

ibis-framework 9.5.0

What backend(s) are you using, if any?

No response

Code of Conduct

IndexSeek commented 6 days ago

Thank you for raising this issue!

But some reason the documentation to extend SQL with expression is quite hidden. I have browsed the documentation a couple of times before but didn't know this feature exist.

I do think it would be worthwhile for us to explain this behavior in greater detail.

I was able to repro this behavior you're seeing. It seems that, at least in the case of the DuckDB backend, the alias method wraps the expression itself in a CTE.

In [1]: import pandas as pd

In [2]: import ibis

In [3]: con = ibis.duckdb.connect()

In [4]: con.create_table("raw_a", pd.DataFrame({"id": [1, 2, 3]}))
Out[4]: 
DatabaseTable: memory.main.raw_a
  id int64

In [5]: con.create_table("raw_b", pd.DataFrame({"id": [2, 3, 4]}))
Out[5]: 
DatabaseTable: memory.main.raw_b
  id int64

In [6]: a = con.sql("SELECT * FROM raw_a").alias("a")

In [7]: b = con.sql("SELECT * FROM raw_b").alias("b")

In [8]: a.sql("SELECT * FROM a").compile()
Out[8]: 'WITH "a" AS (SELECT * FROM raw_a) SELECT * FROM a'

In [9]: b.sql("SELECT * FROM b").compile()
Out[9]: 'WITH "b" AS (SELECT * FROM raw_b) SELECT * FROM b'

Extending the expression alias with .sql would be possible to refer to persisted objects, but I don't think this is what you want to achieve, is it?

In [10]: b.sql("select a.*, b.* from raw_a as a, b where a.id=b.id")
Out[10]: 
r0 := SQLQueryResult
  query:
    SELECT * FROM raw_b
  schema:
    id int64

r1 := View: b
  id int64

SQLStringView[r1]
  query:
    select a.*, b.* from raw_a as a, b where a.id=b.id
  schema:
    id int64
noklam commented 6 days ago

@IndexSeek Thanks for the responss. b.sql("select a.*, b.* from raw_a as a, b where a.id=b.id"), here you can only refer to b directly but not a, you kind of redo the alias here (raw_a as a), but imagine this is actually a full CTE, or multiple one. The it essentially become putting the entire queries in one and that's not what I want.

noklam commented 6 days ago

I actually expect it to be wrapped as a CTE, this kind of provide the flexibility to breakdown a complex query into small one. If this is done in SQL you have to keep one big query, but in ibis things are lazy, so I can either run only a single CTE (for debugging), but also compose multiple CTE to do complex query. (Think dbt ref{{ table}}, which is more like string templating)

IndexSeek commented 6 days ago

You're welcome! I discovered that it's possible to chain multiple .sql methods and continue to alias within the same expression, allowing you to reference those aliases as they are created as CTEs. However, I’m still not sure if this will provide what you’re looking for.

In [14]: con.sql("SELECT * FROM raw_a").alias("a").sql("SELECT * FROM raw_b").alias("b").sql(
    ...:     "SELECT a.id, b.id AS id_2 FROM a, b WHERE a.id=b.id"
    ...: )
Out[14]: 
┏━━━━━━━┳━━━━━━━┓
┃ id    ┃ id_2  ┃
┡━━━━━━━╇━━━━━━━┩
│ int64 │ int64 │
├───────┼───────┤
│     2 │     2 │
│     3 │     3 │
└───────┴───────┘

In [15]: con.sql("SELECT * FROM raw_a").alias("a").sql("SELECT * FROM raw_b").alias("b").sql(
    ...:     "SELECT a.id, b.id AS id_2 FROM a, b WHERE a.id=b.id"
    ...: ).compile()
Out[15]: 'WITH "a" AS (SELECT * FROM raw_a), "b" AS (SELECT * FROM raw_b) SELECT a.id, b.id AS id_2 FROM a, b WHERE a.id = b.id'

This might allow for running the individual expressions as part of .sql independently for debugging and would allow for appending them if the results are as expected, but I'm hoping there might be another way to achieve this.

noklam commented 5 days ago

@IndexSeek This is getting closer and I can somewhat do what I want.

image

Though I think it's weird when I build table b I need to use a.sql where they are completely separate. This will become a problem when I have one CTE but want to shared across different complex queries.

I do not know ibis internal enough, but it seems that the information is only stored in the ibis.expr.types.relations.Table. If there is a way to merge two ibis.expr.types.relations.Table into one then we may be able to get rid of this constraint.