ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.07k stars 586 forks source link

feat: support unnest of struct with trino backend #9998

Open Ezibenroc opened 1 week ago

Ezibenroc commented 1 week ago

Is your feature request related to a problem?

Disclaimer: I am quite new to ibis, so perhaps I missed something.

Let's say I have a table with a column that contains arrays of struct. For example, a struct with two fields:

some_col my_nested_column
1 [{foo: 10, bar: 20}, {foo:11, bar: 21}]
2 [{foo: 12, bar: 22}]

I have this SQL query that works well on my trino DB to unnest this column:

SELECT
    t.foo,
    t.bar
FROM my_db.my_table
CROSS JOIN UNNEST(my_nested_column) AS t (foo, bar)
LIMIT 10

I am trying to do the same with ibis with this code:

t = con.table(name="my_table", database="my_db")
tmp = t.unnest(t.my_nested_column).head(10)
tmp.to_pandas()

Unfortunately, I get the following exception:

TrinoUserError: TrinoUserError(type=USER_ERROR, name=MISMATCHED_COLUMN_ALIASES, message="line 1:391: Column alias list has 1 entries but '"ibis_table_unnest_kelyvy3bprd4dimwx7rk4ia5hi"' has 2 columns available", query_id=20240902_151758_26511_qgqzg)

My guess is that it is because the code produced by the trino backend from ibis does not list the fields of the struct. Calling ibis.to_sql(tmp) produces this query:

SELECT
  "ibis_table_unnest_column_wt6chn4jxra6dbfrnunq7tplom" AS "my_nested_column"
FROM "my_db.my_table" AS "t0"
CROSS JOIN UNNEST("t0"."my_nested_column") AS "ibis_table_unnest_bcm5qmv32bgjtlum4uclrclvpq"("ibis_table_unnest_column_wt6chn4jxra6dbfrnunq7tplom")
LIMIT 10

I found this related SO question, writing explicitly all the fields of the struct seems to be needed.

Note that I tried a similar code on another column that has simple lists of integers instead of lists of structs and it works well, so the struct seem to be the issue.

What is the motivation behind your request?

No response

Describe the solution you'd like

I guess a possible solution would be for the user to give the list of the fields in the unnest function. For instance:

tmp = t.unnest(t.my_nested_column, fields=["foo", "bar"]).head(10)

But I am not sure what the other back-ends should do when this argument is provided. Perhaps just ignore it?

What version of ibis are you running?

9.3.0

What backend(s) are you using, if any?

Trino

Code of Conduct

cpcloud commented 1 week ago

Thanks for the issue.

It seems like we're generating incorrect code (technically sqlglot is).

It also seems like it's not possible to tell Trino not to flatten structs, which is incredibly annoying because it means sqlglot can't do the UNNEST rewrite it's doing without precise column information.

I really don't want to introduce a new API here just to allow Trino to do everything the other UNNEST-supporting backends support. Instead, I'd like to try and work with the various upstream projects to see if we can't get this addressed in some other way.

First, I'm going to explore whether giving sqlglot more information helps with it's transformation and then report a feature request/bug upstream if that doesn't help.

Ezibenroc commented 1 week ago

I just found this issue. I was not aware of this syntax, but apparently the struct flattening can be done without knowing the names of the struct fields

I am not sure if it helps, since the output would still be different from other backends: with trino you would have one column per struct field, while with other backends you would have a single column containing the structs.

cpcloud commented 1 week ago

I am not sure if it helps, since the output would still be different from other backends: with trino you would have one column per struct field, while with other backends you would have a single column containing the structs.

I think this is probably workable, but there's another issue that would defautl

Seems like the additional column added by WITH ORDINALITY doesn't get picked up when not specifying columns 😮‍💨

BUT there's a wretched workaround:

select *
from array_of_structs t0
cross join unnest(
  t0.my_nested_column,
  transform(
    sequence(0, cardinality(t0.my_nested_column) - 1),
    x -> cast(row(x) as row(idx int))
  )
) things

which produces

 some_col |           my_nested_column           | foo | bar | idx
----------+--------------------------------------+-----+-----+-----
        1 | [{foo=10, bar=20}, {foo=11, bar=21}] |  10 |  20 |   0
        1 | [{foo=10, bar=20}, {foo=11, bar=21}] |  11 |  21 |   1
        2 | [{foo=12, bar=22}]                   |  12 |  22 |   0
(3 rows)

on the trino CLI.

I will get the discussion started over on the sqlglot side later today or tomorrow.

cpcloud commented 1 week ago

The sequence function is also kneecapped at 10k elements so there's really no free lunch here.