ibis-project / ibis

the portable Python dataframe library
https://ibis-project.org
Apache License 2.0
5.31k stars 598 forks source link

bug: error on `pivot_longer` of more than one category, using Polars backend #5795

Closed dimitri-b closed 1 year ago

dimitri-b commented 1 year ago

What happened?

When attempting to pivot a table with more than one value in the 'retained' or category column, the error (below) is thrown when using Polar backend, but works fine with DuckDB backend. The same code works fine if there is only one value in the retained column.

Code to reproduce:

"""Polars backend throws an error if more than one category is pivoted."""

import ibis
from ibis import _
import ibis.expr.selectors as s
import pandas as pd

data = {
    "product": ["apples", "oranges"],
    "price": [4, 7],
    "qty": [42, 84]
}

df = pd.DataFrame(data)

conn_duck = ibis.duckdb.connect()
conn_polars = ibis.polars.connect()

conn_duck.register(df, table_name="fruit")
conn_polars.register(df, table_name="fruit")

def pivot_duckdb():
    """Works fine for two categories."""
    t = conn_duck.table("fruit")
    return t.pivot_longer(~s.c("product"))

def pivot_polars_error():
    """ERROR !!!"""
    t = conn_polars.table("fruit")
    return t.pivot_longer(~s.c("product"))

def pivot_polars_ok():
    """Works fine if limited to a single category."""
    t = conn_polars.table("fruit").filter(_["product"] == "apples")
    return t.pivot_longer(~s.c("product"))

if __name__ == '__main__':
    print(pivot_duckdb().execute())  # works OK, two categories
    print(pivot_polars_ok().execute())  # works OK, single category
    print(pivot_polars_error().execute())  # throws error for more than one category

What version of ibis are you using?

5.0.0

What backend(s) are you using, if any?

DuckDB, Polars

Relevant log output

Traceback (most recent call last):
  File "/Users/redacted/PycharmProjects/ibis-demo/bugs/polars_pivot.py", line 41, in <module>
    print(pivot_polars_error().execute())
  File "/Users/redacted/.pyenv/versions/ibis-experiments/lib/python3.10/site-packages/ibis/expr/types/core.py", line 303, in execute
    return self._find_backend(use_default=True).execute(
  File "/Users/redacted/.pyenv/versions/ibis-experiments/lib/python3.10/site-packages/ibis/backends/polars/__init__.py", line 328, in execute
    df = lf.collect()
  File "/Users/redacted/.pyenv/versions/ibis-experiments/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 1443, in collect
    return pli.wrap_df(ldf.collect())
exceptions.ComputeError: series length 2 doesn't match the dataframe height of 4

Code of Conduct

cpcloud commented 1 year ago

@dimitri-b Thanks for issue!

I think this may be because polars doesn't support more than one explode/unnest in a single select statement.

I'm not entirely sure how we want to address this, ideally a fix upstream is possible if indeed this isn't supported in polars.

@ritchie46 Is it possible to have multiple explode/unnest expressions in a single select?

ritchie46 commented 1 year ago

@cpcloud could you give a MWE of what you translate to in polars atm and what you want to do?

cpcloud commented 1 year ago

@ritchie46 I would like the following code to work:

In [112]: import polars as pl

In [113]: df = pl.read_csv(
     ...:     "https://storage.googleapis.com/ibis-examples/data/world_bank_pop_raw.csv.gz"
     ...: )

In [114]: df.select(pl.concat_list([pl.lit("2001"), pl.lit("2001")]).explode(), pl.list(["2000", "2001"]).explode())

Right now, it raises an exception:

ComputeError: series length 2 doesn't match the dataframe height of 1056

Ideally, the semantics are equivalent to this duckdb SQL:

SELECT
  UNNEST(['2000', '2001']) AS year,
  UNNEST([t0."2000", t0."2001"]) AS value
FROM read_csv_auto('https://storage.googleapis.com/ibis-examples/data/world_bank_pop_raw.csv.gz', header=true)

the unnests (explode in polars-land) have equivalent cardinalities and are unnested pair wise.

For each row of the source table in the FROM clause, and for each row of the pair of unnest calls, there's a new row equal to each array element.

The simpler case of only unnesting literals has unexpected (to me) behavior:

In [129]: df.select(
     ...:     pl.concat_list([pl.lit("2000"), pl.lit("2001")]).arr.explode(),
     ...: )
Out[129]:
shape: (2, 1)
┌─────────┐
│ literal │
│ ---     │
│ str     │
╞═════════╡
│ 2000    │
│ 2001    │
└─────────┘

I would expect to see 2 * len(df) here since I'm selecting from a table with len(df) rows.

cpcloud commented 1 year ago

~Ok, it looks like with_column sort of does the trick.~

~Ideally df.select(x=...) and df.with_column(....alias('x')) are equivalent, but maybe there's a reason they aren't!~

with_columns still raises an exception:

In [6]: df.with_columns(
   ...:     year=pl.concat_list([pl.lit("2001"), pl.lit("2001")]).arr.explode(),
   ...: )
ShapeError: unable to add a column of length 2 to a dataframe of height 1056
cpcloud commented 1 year ago

Trying to construct the list first and then explode also doesn't work:

In [8]: df.with_columns(year=pl.concat_list([pl.lit("2001"), pl.lit("2001")])).with_columns(
   ...:     pl.col("year").arr.explode()
   ...: )
ShapeError: unable to add a column of length 2112 to a dataframe of height 1056
cpcloud commented 1 year ago

Hm, but somehow select works?

In [10]: df.with_columns(year=pl.concat_list([pl.lit("2001"), pl.lit("2001")])).select(
    ...:     pl.col("year").arr.explode()
    ...: )
Out[10]:
shape: (2112, 1)
┌──────┐
│ year │
│ ---  │
│ str  │
╞══════╡
│ 2001 │
│ 2001 │
│ 2001 │
│ 2001 │
│ …    │
│ 2001 │
│ 2001 │
│ 2001 │
│ 2001 │
└──────┘
ritchie46 commented 1 year ago

The single select works because it is the only selected column. It flattens the array, but the array with 2x df.height length. So if you with select any other column it would produce an error. If you want to explode all data, you should explode on the DataFrame itself:

(df.with_columns(
    year=pl.concat_list([pl.lit("2001"), pl.lit("2001")]))
    .explode("year")
)

This is because polars expressions are not allowed to depend on any other state but their input.

cpcloud commented 1 year ago

I can't seem to be able to get anything to work 😅

In [7]: df.with_columns(
   ...:     year=pl.concat_list([pl.lit("2000"), pl.lit("2001")]),
   ...:     metric=pl.list(["2000", "2001"]),
   ...: ).explode(["year", "metric"])
ComputeError: The name: 'metric' passed to `LazyFrame.with_columns` is duplicate

Error originated just after this operation:
DF ["country", "indicator", "2000", "2001"]; PROJECT */20 COLUMNS; SELECTION: "None"

Not sure where the duplicate metric error is coming from:

In [8]: df.columns
Out[8]:
['country',
 'indicator',
 '2000',
 '2001',
 '2002',
 '2003',
 '2004',
 '2005',
 '2006',
 '2007',
 '2008',
 '2009',
 '2010',
 '2011',
 '2012',
 '2013',
 '2014',
 '2015',
 '2016',
 '2017']

The error occurs without the subsequent explode call as well.

ritchie46 commented 1 year ago

Try concat_list. I agree that this a bit confusing. But pl.list() is syntactic sugar for pl.col(..).list(), and we have expression expansion logic on multiple columns passed so your expression: pl.list(["2000", "2001"]) expands to:

[
    pl.col("2000").alias("metric"),
    pl.col("2001").alias("metric"),
]

Selecting two columns that don't exist and renaming them both to "metric", hence the duplication error.

cpcloud commented 1 year ago

@ritchie46 Circling back, is there a way to do what I'm asking for here (pairwise unnest)?

I'd be okay with a solution that requires calling DataFrame.explode, but as of now it seems like there just isn't a way to express simultaneous unnesting of array columns with the semantics of the SQL above.

ritchie46 commented 1 year ago

I cannot get your duckdb example to work. Do you mean something like this?

(df.with_columns(
    year = pl.Series([["2000", "2001"]]),
    metric = pl.Series([["2000", "2001"]])
).explode("year", "metric")
 .head()
)
shape: (5, 22)
┌─────────┬─────────────┬──────────┬──────────┬───┬──────────┬──────────┬──────┬────────┐
│ country ┆ indicator   ┆ 2000     ┆ 2001     ┆ … ┆ 2016     ┆ 2017     ┆ year ┆ metric │
│ ---     ┆ ---         ┆ ---      ┆ ---      ┆   ┆ ---      ┆ ---      ┆ ---  ┆ ---    │
│ str     ┆ str         ┆ f64      ┆ f64      ┆   ┆ f64      ┆ f64      ┆ str  ┆ str    │
╞═════════╪═════════════╪══════════╪══════════╪═══╪══════════╪══════════╪══════╪════════╡
│ ABW     ┆ SP.URB.TOTL ┆ 42444.0  ┆ 43048.0  ┆ … ┆ 45275.0  ┆ 45572.0  ┆ 2000 ┆ 2000   │
│ ABW     ┆ SP.URB.TOTL ┆ 42444.0  ┆ 43048.0  ┆ … ┆ 45275.0  ┆ 45572.0  ┆ 2001 ┆ 2001   │
│ ABW     ┆ SP.URB.GROW ┆ 1.182632 ┆ 1.413021 ┆ … ┆ 0.655929 ┆ 0.653849 ┆ 2000 ┆ 2000   │
│ ABW     ┆ SP.URB.GROW ┆ 1.182632 ┆ 1.413021 ┆ … ┆ 0.655929 ┆ 0.653849 ┆ 2001 ┆ 2001   │
│ ABW     ┆ SP.POP.TOTL ┆ 90853.0  ┆ 92898.0  ┆ … ┆ 104822.0 ┆ 105264.0 ┆ 2000 ┆ 2000   │
└─────────┴─────────────┴──────────┴──────────┴───┴──────────┴──────────┴──────┴────────┘
cpcloud commented 1 year ago

I cannot get your duckdb example to work.

Apologies, this works:

SELECT
  UNNEST(['2000', '2001']) AS year,
  UNNEST([t0."2000", t0."2001"]) AS value
FROM read_csv_auto('https://storage.googleapis.com/ibis-examples/data/world_bank_pop_raw.csv.gz', header=true) t0

I was missing the t0 alias on the FROM expression.

Do you mean something like this?

No, I want to unnest the column names and the content of those names columns.

(df.with_columns(
    year = pl.Series([["2000", "2001"]]),
    metric = pl.Series([["2000", "2001"]])
).explode("year", "metric")
 .head()
)
shape: (5, 22)
┌─────────┬─────────────┬──────────┬──────────┬───┬──────────┬──────────┬──────┬────────┐
│ country ┆ indicator   ┆ 2000     ┆ 2001     ┆ … ┆ 2016     ┆ 2017     ┆ year ┆ metric │
│ ---     ┆ ---         ┆ ---      ┆ ---      ┆   ┆ ---      ┆ ---      ┆ ---  ┆ ---    │
│ str     ┆ str         ┆ f64      ┆ f64      ┆   ┆ f64      ┆ f64      ┆ str  ┆ str    │
╞═════════╪═════════════╪══════════╪══════════╪═══╪══════════╪══════════╪══════╪════════╡
│ ABW     ┆ SP.URB.TOTL ┆ 42444.0  ┆ 43048.0  ┆ … ┆ 45275.0  ┆ 45572.0  ┆ 2000 ┆ 2000   │
│ ABW     ┆ SP.URB.TOTL ┆ 42444.0  ┆ 43048.0  ┆ … ┆ 45275.0  ┆ 45572.0  ┆ 2001 ┆ 2001   │
│ ABW     ┆ SP.URB.GROW ┆ 1.182632 ┆ 1.413021 ┆ … ┆ 0.655929 ┆ 0.653849 ┆ 2000 ┆ 2000   │
│ ABW     ┆ SP.URB.GROW ┆ 1.182632 ┆ 1.413021 ┆ … ┆ 0.655929 ┆ 0.653849 ┆ 2001 ┆ 2001   │
│ ABW     ┆ SP.POP.TOTL ┆ 90853.0  ┆ 92898.0  ┆ … ┆ 104822.0 ┆ 105264.0 ┆ 2000 ┆ 2000   │
└─────────┴─────────────┴──────────┴──────────┴───┴──────────┴──────────┴──────┴────────┘
ritchie46 commented 1 year ago

Right:

df.melt(
    value_vars=["2000", "2001"],
    variable_name="year",
    value_name="value"
)
cpcloud commented 1 year ago

Ok, so it's not possible without melt then?

We implement pivot_longer (a superset of melt) using ibis primitives instead of backend-specific functionality so that the implementation works across as many backends as possible.

Polars is the only backend that supports arrays and unnesting that doesn't work with pivot_longer.

Is there any chance N-ary unnesting can be implemented upstream?

ritchie46 commented 1 year ago

Polars is the only backend that supports arrays and unnesting that doesn't work with pivot_longer.

I don't fully understand what you mean here. But we cannot support unnesting by expressions in select. That would not follow our expression rules.

We could support a pivot_longer method, but it is not really clear to me what the benefit of that would be over melt.

cpcloud commented 1 year ago

Right, going back to my SQL example:

SELECT
  UNNEST(['2000', '2001']) AS year,
  UNNEST([t0."2000", t0."2001"]) AS value
FROM read_csv_auto('https://storage.googleapis.com/ibis-examples/data/world_bank_pop_raw.csv.gz', header=true) t0

My goal is to understand what the equivalent polars code for that is, using some combination of select/with_columns/explode?

If there is no such code, can that be implemented upstream in a way that follows the polars expression rules?

ritchie46 commented 1 year ago

I have to think of this a bit. I can see how a pl.unnest([columns]) would work. But I don't see how a second pl.unnest([names]) would match the output length of the unnested columns.

This is something that would not be allowed in our select / with_columns contexts. So that would need a melt or pivot_longer context in such a case.

cpcloud commented 1 year ago

I have to think of this a bit. I can see how a pl.unnest([columns]) would work. But I don't see how a second pl.unnest([names]) would match the output length of the unnested columns.

The behavior is a pairwise unnest. This is well defined in both postgres and duckdb:

postgres:

SELECT position:

ibis_testing=# select unnest(array[1, 2, 3]) x, unnest(array['a', 'b']) y;
 x | y
---+---
 1 | a
 2 | b
 3 |
(3 rows)

FROM position:

ibis_testing=# select * from unnest(array[1, 2, 3], array['a', 'b']) as _(x, y);
 x | y
---+---
 1 | a
 2 | b
 3 |
(3 rows)

duckdb (only allow multiple unnests in SELECT position):

D select unnest([1, 2, 3]) x, unnest(['a', 'b']) y;
┌───────┬─────────┐
│   x   │    y    │
│ int32 │ varchar │
├───────┼─────────┤
│     1 │ a       │
│     2 │ b       │
│     3 │         │
└───────┴─────────┘
ritchie46 commented 1 year ago

Those states are not allowed in polars. We only allow columns with the same length. With the exception of unit lengths which are broadcasted to the result length.

What we could do is a single pl.unnest expression that would accept inner unnests expressions.

That would then return a struct as polars expressions are only allowed to produce a single column. That struct can then be turned in a DataFrame later?

It doesn't fit entirely on the duckdb and postgres examples. We can get the same result, but not in a single select.

cpcloud commented 1 year ago

@ritchie46 Coming back to this. Is there a way I can construct an array from another expression?

Something like

df.with_columns([pl.col("my_column")])
df.with_columns(pl.col(["my_column"]))

?

ritchie46 commented 1 year ago

What do you mean? Converting an integer column to a List<i64>?

cpcloud commented 1 year ago

Yep!

ritchie46 commented 1 year ago
pl.DataFrame({
    "a": [1, 2],
}).select(
    list_1 = pl.col("a").reshape((-1, 1)),
    list_2_broadcasted_to_df_height = pl.col("a").reshape((1, -1)),
)
shape: (2, 2)
┌───────────┬─────────────────────────────────┐
│ list_1    ┆ list_2_broadcasted_to_df_height │
│ ---       ┆ ---                             │
│ list[i64] ┆ list[i64]                       │
╞═══════════╪═════════════════════════════════╡
│ [1]       ┆ [1, 2]                          │
│ [2]       ┆ [1, 2]                          │
└───────────┴─────────────────────────────────┘
cpcloud commented 1 year ago

Ok, awesome! I think this gives me enough to finally be able to address this issue:

In [21]: import polars as pl

In [22]: df = pl.read_csv('/home/cloud/.cache/ibis-framework/6.0.0/world_bank_pop.csv.gz')

In [23]: df.with_columns(year=pl.Series([["2000", "2001"]]), metric=pl.col("x2000").reshape((-1, 1)).list.concat(pl.col("x2001").reshape((-1, 1)))).explode("year", "metric")
Out[23]:
shape: (2_112, 22)
┌─────────┬─────────────┬─────────────┬─────────────┬───┬─────────────┬─────────────┬──────┬─────────────┐
│ country ┆ indicator   ┆ x2000       ┆ x2001       ┆ … ┆ x2016       ┆ x2017       ┆ year ┆ metric      │
│ ---     ┆ ---         ┆ ---         ┆ ---         ┆   ┆ ---         ┆ ---         ┆ ---  ┆ ---         │
│ str     ┆ str         ┆ f64         ┆ f64         ┆   ┆ f64         ┆ f64         ┆ str  ┆ f64         │
╞═════════╪═════════════╪═════════════╪═════════════╪═══╪═════════════╪═════════════╪══════╪═════════════╡
│ ABW     ┆ SP.URB.TOTL ┆ 42444.0     ┆ 43048.0     ┆ … ┆ 45275.0     ┆ 45572.0     ┆ 2000 ┆ 42444.0     │
│ ABW     ┆ SP.URB.TOTL ┆ 42444.0     ┆ 43048.0     ┆ … ┆ 45275.0     ┆ 45572.0     ┆ 2001 ┆ 43048.0     │
│ ABW     ┆ SP.URB.GROW ┆ 1.182632    ┆ 1.413021    ┆ … ┆ 0.655929    ┆ 0.653849    ┆ 2000 ┆ 1.182632    │
│ ABW     ┆ SP.URB.GROW ┆ 1.182632    ┆ 1.413021    ┆ … ┆ 0.655929    ┆ 0.653849    ┆ 2001 ┆ 1.413021    │
│ …       ┆ …           ┆ …           ┆ …           ┆ … ┆ …           ┆ …           ┆ …    ┆ …           │
│ ZWE     ┆ SP.POP.TOTL ┆ 1.2222251e7 ┆ 1.2366165e7 ┆ … ┆ 1.6150362e7 ┆ 1.6529904e7 ┆ 2000 ┆ 1.2222251e7 │
│ ZWE     ┆ SP.POP.TOTL ┆ 1.2222251e7 ┆ 1.2366165e7 ┆ … ┆ 1.6150362e7 ┆ 1.6529904e7 ┆ 2001 ┆ 1.2366165e7 │
│ ZWE     ┆ SP.POP.GROW ┆ 1.298782    ┆ 1.170597    ┆ … ┆ 2.33607     ┆ 2.322864    ┆ 2000 ┆ 1.298782    │
│ ZWE     ┆ SP.POP.GROW ┆ 1.298782    ┆ 1.170597    ┆ … ┆ 2.33607     ┆ 2.322864    ┆ 2001 ┆ 1.170597    │
└─────────┴─────────────┴─────────────┴─────────────┴───┴─────────────┴─────────────┴──────┴─────────────┘

Will report back if I still need some more help!

cpcloud commented 1 year ago

One thing that could be useful is something like a pl.make_list(x) or pl.list(x) that is effectively pl.col(x).reshape((-1, 1)). "Reshape" is just not a verb that comes to mind when constructing a list from a column 🙂

ritchie46 commented 1 year ago

There is also implode as the counterpart of explode which makes a list. But then you cannot control the shape.