Closed dimitri-b closed 1 year ago
@dimitri-b Thanks for issue!
I think this may be because polars doesn't support more than one explode/unnest in a single select statement.
I'm not entirely sure how we want to address this, ideally a fix upstream is possible if indeed this isn't supported in polars.
@ritchie46 Is it possible to have multiple explode/unnest expressions in a single select?
@cpcloud could you give a MWE of what you translate to in polars atm and what you want to do?
@ritchie46 I would like the following code to work:
In [112]: import polars as pl
In [113]: df = pl.read_csv(
...: "https://storage.googleapis.com/ibis-examples/data/world_bank_pop_raw.csv.gz"
...: )
In [114]: df.select(pl.concat_list([pl.lit("2001"), pl.lit("2001")]).explode(), pl.list(["2000", "2001"]).explode())
Right now, it raises an exception:
ComputeError: series length 2 doesn't match the dataframe height of 1056
Ideally, the semantics are equivalent to this duckdb SQL:
SELECT
UNNEST(['2000', '2001']) AS year,
UNNEST([t0."2000", t0."2001"]) AS value
FROM read_csv_auto('https://storage.googleapis.com/ibis-examples/data/world_bank_pop_raw.csv.gz', header=true)
the unnests (explode
in polars-land) have equivalent cardinalities and are unnested pair wise.
For each row of the source table in the FROM
clause, and for each row of the pair of unnest calls, there's a new row equal to each array element.
The simpler case of only unnesting literals has unexpected (to me) behavior:
In [129]: df.select(
...: pl.concat_list([pl.lit("2000"), pl.lit("2001")]).arr.explode(),
...: )
Out[129]:
shape: (2, 1)
┌─────────┐
│ literal │
│ --- │
│ str │
╞═════════╡
│ 2000 │
│ 2001 │
└─────────┘
I would expect to see 2 * len(df)
here since I'm selecting from a table with len(df)
rows.
~Ok, it looks like with_column
sort of does the trick.~
~Ideally df.select(x=...)
and df.with_column(....alias('x'))
are equivalent, but maybe there's a reason they aren't!~
with_columns
still raises an exception:
In [6]: df.with_columns(
...: year=pl.concat_list([pl.lit("2001"), pl.lit("2001")]).arr.explode(),
...: )
ShapeError: unable to add a column of length 2 to a dataframe of height 1056
Trying to construct the list first and then explode also doesn't work:
In [8]: df.with_columns(year=pl.concat_list([pl.lit("2001"), pl.lit("2001")])).with_columns(
...: pl.col("year").arr.explode()
...: )
ShapeError: unable to add a column of length 2112 to a dataframe of height 1056
Hm, but somehow select
works?
In [10]: df.with_columns(year=pl.concat_list([pl.lit("2001"), pl.lit("2001")])).select(
...: pl.col("year").arr.explode()
...: )
Out[10]:
shape: (2112, 1)
┌──────┐
│ year │
│ --- │
│ str │
╞══════╡
│ 2001 │
│ 2001 │
│ 2001 │
│ 2001 │
│ … │
│ 2001 │
│ 2001 │
│ 2001 │
│ 2001 │
└──────┘
The single select works because it is the only selected column. It flattens the array, but the array with 2x df.height length. So if you with select any other column it would produce an error. If you want to explode all data, you should explode on the DataFrame
itself:
(df.with_columns(
year=pl.concat_list([pl.lit("2001"), pl.lit("2001")]))
.explode("year")
)
This is because polars expressions are not allowed to depend on any other state but their input.
I can't seem to be able to get anything to work 😅
In [7]: df.with_columns(
...: year=pl.concat_list([pl.lit("2000"), pl.lit("2001")]),
...: metric=pl.list(["2000", "2001"]),
...: ).explode(["year", "metric"])
ComputeError: The name: 'metric' passed to `LazyFrame.with_columns` is duplicate
Error originated just after this operation:
DF ["country", "indicator", "2000", "2001"]; PROJECT */20 COLUMNS; SELECTION: "None"
Not sure where the duplicate metric
error is coming from:
In [8]: df.columns
Out[8]:
['country',
'indicator',
'2000',
'2001',
'2002',
'2003',
'2004',
'2005',
'2006',
'2007',
'2008',
'2009',
'2010',
'2011',
'2012',
'2013',
'2014',
'2015',
'2016',
'2017']
The error occurs without the subsequent explode
call as well.
Try concat_list
. I agree that this a bit confusing. But pl.list()
is syntactic sugar for pl.col(..).list()
, and we have expression expansion logic on multiple columns passed so your expression: pl.list(["2000", "2001"])
expands to:
[
pl.col("2000").alias("metric"),
pl.col("2001").alias("metric"),
]
Selecting two columns that don't exist and renaming them both to "metric"
, hence the duplication error.
@ritchie46 Circling back, is there a way to do what I'm asking for here (pairwise unnest)?
I'd be okay with a solution that requires calling DataFrame.explode
, but as of now it seems like there just isn't a way to express simultaneous unnesting of array columns with the semantics of the SQL above.
I cannot get your duckdb example to work. Do you mean something like this?
(df.with_columns(
year = pl.Series([["2000", "2001"]]),
metric = pl.Series([["2000", "2001"]])
).explode("year", "metric")
.head()
)
shape: (5, 22)
┌─────────┬─────────────┬──────────┬──────────┬───┬──────────┬──────────┬──────┬────────┐
│ country ┆ indicator ┆ 2000 ┆ 2001 ┆ … ┆ 2016 ┆ 2017 ┆ year ┆ metric │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ str ┆ str │
╞═════════╪═════════════╪══════════╪══════════╪═══╪══════════╪══════════╪══════╪════════╡
│ ABW ┆ SP.URB.TOTL ┆ 42444.0 ┆ 43048.0 ┆ … ┆ 45275.0 ┆ 45572.0 ┆ 2000 ┆ 2000 │
│ ABW ┆ SP.URB.TOTL ┆ 42444.0 ┆ 43048.0 ┆ … ┆ 45275.0 ┆ 45572.0 ┆ 2001 ┆ 2001 │
│ ABW ┆ SP.URB.GROW ┆ 1.182632 ┆ 1.413021 ┆ … ┆ 0.655929 ┆ 0.653849 ┆ 2000 ┆ 2000 │
│ ABW ┆ SP.URB.GROW ┆ 1.182632 ┆ 1.413021 ┆ … ┆ 0.655929 ┆ 0.653849 ┆ 2001 ┆ 2001 │
│ ABW ┆ SP.POP.TOTL ┆ 90853.0 ┆ 92898.0 ┆ … ┆ 104822.0 ┆ 105264.0 ┆ 2000 ┆ 2000 │
└─────────┴─────────────┴──────────┴──────────┴───┴──────────┴──────────┴──────┴────────┘
I cannot get your duckdb example to work.
Apologies, this works:
SELECT
UNNEST(['2000', '2001']) AS year,
UNNEST([t0."2000", t0."2001"]) AS value
FROM read_csv_auto('https://storage.googleapis.com/ibis-examples/data/world_bank_pop_raw.csv.gz', header=true) t0
I was missing the t0
alias on the FROM
expression.
Do you mean something like this?
No, I want to unnest the column names and the content of those names columns.
(df.with_columns( year = pl.Series([["2000", "2001"]]), metric = pl.Series([["2000", "2001"]]) ).explode("year", "metric") .head() )
shape: (5, 22) ┌─────────┬─────────────┬──────────┬──────────┬───┬──────────┬──────────┬──────┬────────┐ │ country ┆ indicator ┆ 2000 ┆ 2001 ┆ … ┆ 2016 ┆ 2017 ┆ year ┆ metric │ │ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ str ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ str ┆ str │ ╞═════════╪═════════════╪══════════╪══════════╪═══╪══════════╪══════════╪══════╪════════╡ │ ABW ┆ SP.URB.TOTL ┆ 42444.0 ┆ 43048.0 ┆ … ┆ 45275.0 ┆ 45572.0 ┆ 2000 ┆ 2000 │ │ ABW ┆ SP.URB.TOTL ┆ 42444.0 ┆ 43048.0 ┆ … ┆ 45275.0 ┆ 45572.0 ┆ 2001 ┆ 2001 │ │ ABW ┆ SP.URB.GROW ┆ 1.182632 ┆ 1.413021 ┆ … ┆ 0.655929 ┆ 0.653849 ┆ 2000 ┆ 2000 │ │ ABW ┆ SP.URB.GROW ┆ 1.182632 ┆ 1.413021 ┆ … ┆ 0.655929 ┆ 0.653849 ┆ 2001 ┆ 2001 │ │ ABW ┆ SP.POP.TOTL ┆ 90853.0 ┆ 92898.0 ┆ … ┆ 104822.0 ┆ 105264.0 ┆ 2000 ┆ 2000 │ └─────────┴─────────────┴──────────┴──────────┴───┴──────────┴──────────┴──────┴────────┘
Right:
df.melt(
value_vars=["2000", "2001"],
variable_name="year",
value_name="value"
)
Ok, so it's not possible without melt
then?
We implement pivot_longer
(a superset of melt
) using ibis primitives instead of backend-specific functionality so that the implementation works across as many backends as possible.
Polars is the only backend that supports arrays and unnesting that doesn't work with pivot_longer
.
Is there any chance N-ary unnesting can be implemented upstream?
Polars is the only backend that supports arrays and unnesting that doesn't work with pivot_longer.
I don't fully understand what you mean here. But we cannot support unnesting by expressions in select
. That would not follow our expression rules.
We could support a pivot_longer
method, but it is not really clear to me what the benefit of that would be over melt
.
Right, going back to my SQL example:
SELECT
UNNEST(['2000', '2001']) AS year,
UNNEST([t0."2000", t0."2001"]) AS value
FROM read_csv_auto('https://storage.googleapis.com/ibis-examples/data/world_bank_pop_raw.csv.gz', header=true) t0
My goal is to understand what the equivalent polars code for that is, using some combination of select
/with_columns
/explode
?
If there is no such code, can that be implemented upstream in a way that follows the polars expression rules?
I have to think of this a bit. I can see how a pl.unnest([columns])
would work. But I don't see how a second pl.unnest([names])
would match the output length of the unnested columns.
This is something that would not be allowed in our select
/ with_columns
contexts. So that would need a melt
or pivot_longer
context in such a case.
I have to think of this a bit. I can see how a
pl.unnest([columns])
would work. But I don't see how a secondpl.unnest([names])
would match the output length of the unnested columns.
The behavior is a pairwise unnest. This is well defined in both postgres and duckdb:
postgres:
SELECT
position:
ibis_testing=# select unnest(array[1, 2, 3]) x, unnest(array['a', 'b']) y;
x | y
---+---
1 | a
2 | b
3 |
(3 rows)
FROM
position:
ibis_testing=# select * from unnest(array[1, 2, 3], array['a', 'b']) as _(x, y);
x | y
---+---
1 | a
2 | b
3 |
(3 rows)
duckdb (only allow multiple unnests in SELECT
position):
D select unnest([1, 2, 3]) x, unnest(['a', 'b']) y;
┌───────┬─────────┐
│ x │ y │
│ int32 │ varchar │
├───────┼─────────┤
│ 1 │ a │
│ 2 │ b │
│ 3 │ │
└───────┴─────────┘
Those states are not allowed in polars. We only allow columns with the same length. With the exception of unit lengths which are broadcasted to the result length.
What we could do is a single pl.unnest
expression that would accept inner unnests
expressions.
That would then return a struct
as polars expressions are only allowed to produce a single column. That struct can then be turned in a DataFrame later?
It doesn't fit entirely on the duckdb and postgres examples. We can get the same result, but not in a single select
.
@ritchie46 Coming back to this. Is there a way I can construct an array from another expression?
Something like
df.with_columns([pl.col("my_column")])
df.with_columns(pl.col(["my_column"]))
?
What do you mean? Converting an integer column to a List<i64>
?
Yep!
pl.DataFrame({
"a": [1, 2],
}).select(
list_1 = pl.col("a").reshape((-1, 1)),
list_2_broadcasted_to_df_height = pl.col("a").reshape((1, -1)),
)
shape: (2, 2)
┌───────────┬─────────────────────────────────┐
│ list_1 ┆ list_2_broadcasted_to_df_height │
│ --- ┆ --- │
│ list[i64] ┆ list[i64] │
╞═══════════╪═════════════════════════════════╡
│ [1] ┆ [1, 2] │
│ [2] ┆ [1, 2] │
└───────────┴─────────────────────────────────┘
Ok, awesome! I think this gives me enough to finally be able to address this issue:
In [21]: import polars as pl
In [22]: df = pl.read_csv('/home/cloud/.cache/ibis-framework/6.0.0/world_bank_pop.csv.gz')
In [23]: df.with_columns(year=pl.Series([["2000", "2001"]]), metric=pl.col("x2000").reshape((-1, 1)).list.concat(pl.col("x2001").reshape((-1, 1)))).explode("year", "metric")
Out[23]:
shape: (2_112, 22)
┌─────────┬─────────────┬─────────────┬─────────────┬───┬─────────────┬─────────────┬──────┬─────────────┐
│ country ┆ indicator ┆ x2000 ┆ x2001 ┆ … ┆ x2016 ┆ x2017 ┆ year ┆ metric │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ str ┆ f64 │
╞═════════╪═════════════╪═════════════╪═════════════╪═══╪═════════════╪═════════════╪══════╪═════════════╡
│ ABW ┆ SP.URB.TOTL ┆ 42444.0 ┆ 43048.0 ┆ … ┆ 45275.0 ┆ 45572.0 ┆ 2000 ┆ 42444.0 │
│ ABW ┆ SP.URB.TOTL ┆ 42444.0 ┆ 43048.0 ┆ … ┆ 45275.0 ┆ 45572.0 ┆ 2001 ┆ 43048.0 │
│ ABW ┆ SP.URB.GROW ┆ 1.182632 ┆ 1.413021 ┆ … ┆ 0.655929 ┆ 0.653849 ┆ 2000 ┆ 1.182632 │
│ ABW ┆ SP.URB.GROW ┆ 1.182632 ┆ 1.413021 ┆ … ┆ 0.655929 ┆ 0.653849 ┆ 2001 ┆ 1.413021 │
│ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … ┆ … │
│ ZWE ┆ SP.POP.TOTL ┆ 1.2222251e7 ┆ 1.2366165e7 ┆ … ┆ 1.6150362e7 ┆ 1.6529904e7 ┆ 2000 ┆ 1.2222251e7 │
│ ZWE ┆ SP.POP.TOTL ┆ 1.2222251e7 ┆ 1.2366165e7 ┆ … ┆ 1.6150362e7 ┆ 1.6529904e7 ┆ 2001 ┆ 1.2366165e7 │
│ ZWE ┆ SP.POP.GROW ┆ 1.298782 ┆ 1.170597 ┆ … ┆ 2.33607 ┆ 2.322864 ┆ 2000 ┆ 1.298782 │
│ ZWE ┆ SP.POP.GROW ┆ 1.298782 ┆ 1.170597 ┆ … ┆ 2.33607 ┆ 2.322864 ┆ 2001 ┆ 1.170597 │
└─────────┴─────────────┴─────────────┴─────────────┴───┴─────────────┴─────────────┴──────┴─────────────┘
Will report back if I still need some more help!
One thing that could be useful is something like a pl.make_list(x)
or pl.list(x)
that is effectively pl.col(x).reshape((-1, 1))
. "Reshape" is just not a verb that comes to mind when constructing a list from a column 🙂
There is also implode
as the counterpart of explode
which makes a list. But then you cannot control the shape.
What happened?
When attempting to pivot a table with more than one value in the 'retained' or category column, the error (below) is thrown when using Polar backend, but works fine with DuckDB backend. The same code works fine if there is only one value in the retained column.
Code to reproduce:
What version of ibis are you using?
5.0.0
What backend(s) are you using, if any?
DuckDB, Polars
Relevant log output
Code of Conduct