pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.19k stars 1.84k forks source link

Feature request: `explode` option/alternative that does not map empty lists to `null` #14381

Open Gattocrucco opened 7 months ago

Gattocrucco commented 7 months ago

Checks

Reproducible example

import polars as pl

df = pl.select(pl.lit([]).explode())
print(len(df)) # -> 1, should be 0
print(df.item()) # -> None (null value)

df2 = pl.select(pl.lit([]).alias('a')).explode('a')
print(len(df)) # -> 1, should be 0
print(df.item()) # -> None (null value)
1
None
1
None

Log output

No response

Issue description

Expr.explode() and DataFrame.explode() convert empty lists to a list containing a single null value. This changes the length of the sequence.

(Is this related to #6723?)

Expected behavior

I strongly expect explode to preserve the length of the input list in the output series.

Installed versions

``` --------Version info--------- Polars: 0.20.7 Index type: UInt32 Platform: macOS-14.2.1-arm64-arm-64bit Python: 3.12.1 | packaged by conda-forge | (main, Dec 23 2023, 08:01:35) [Clang 16.0.6 ] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fsspec: gevent: hvplot: matplotlib: 3.8.2 numpy: 1.26.4 openpyxl: pandas: pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
ritchie46 commented 7 months ago

pl.lit([]), produces a a list with a single [] element. Empty list elements explode to null. This is expected behavior.

Gattocrucco commented 7 months ago

If find this very counterintuitive, because the length changes. Could you add a red box in the documentation of explode explaining this?

ritchie46 commented 7 months ago

The length doesn't change:

pl.select(
    pl.lit([]).alias("literal"),
    pl.lit([]).explode().alias("exploded"),
)
shape: (1, 2)
┌────────────┬──────────┐
│ literal    ┆ exploded │
│ ---        ┆ ---      │
│ list[null] ┆ null     │
╞════════════╪══════════╡
│ []         ┆ null     │
└────────────┴──────────┘
mcrumiller commented 7 months ago

@ritchie46 but it does, we're talking about total number of elements in the lists. In your example you went from 0 total elements to 1. See here:

# 5 total elements
>>> s = pl.Series([[1, 1], [1, 1, 1], []])
>>> s.list.len()
shape: (3,)
Series: '' [u32]
[
        2
        3       <-- there are five items
        0
]
>>> s.explode()
shape: (6,)      <-- there are six items
Series: '' [i64]
[
        1
        1
        1
        1
        1
        null
]
Gattocrucco commented 7 months ago

(EDIT: replying to @ritchie46, not @mcrumiller)

When I say that "the length changes", I'm not referring to the length of the output series compared across the two cases where I explode or not. I'm referring to the length of the lists that are exploded. For all other list lengths but 0, the length of the output series is equal to the sum of the lengths of the lists in the input series.

Consider this example:

pl.DataFrame({
    'a': [[1], [2, 3]],
    'b': [[4, 5], [6]]
}).select(pl.all().explode())
shape: (3, 2)
┌─────┬─────┐
│ a   ┆ b   │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1   ┆ 4   │
│ 2   ┆ 5   │
│ 3   ┆ 6   │
└─────┴─────┘

Now, consider this instead:

pl.DataFrame({
    'a': [[1], [2, 3]],
    'b': [[4, 5, 6], []]
}).select(pl.all().explode())
ComputeError: Series length 3 doesn't match the DataFrame height of 4

In this second case, the explosion of [[4,5,6],[]] has produced [4,5,6,null], which does not match the length of the other series, leading to an error.

This behavior breaks the general invariance len(output) = sum(map(len, input)).

Of course it is not a law of Physics that len(output) = sum(map(len, input)), it is a possible convention. But this convention plays nicely with loops on sequential indices, and with implode-operate-explode patterns.

I've found the following clearer example of the inconsistency. Consider these two alternative ways of producing the same dataframe:

WAY 1:

pl.DataFrame({'a':[]}).select(pl.all().implode())
shape: (1, 1)
┌────────────┐
│ a          │
│ ---        │
│ list[null] │
╞════════════╡
│ []         │
└────────────┘

WAY 2:

pl.select(pl.lit([]).alias('a'))
shape: (1, 1)
┌────────────┐
│ a          │
│ ---        │
│ list[null] │
╞════════════╡
│ []         │
└────────────┘

Now I explode both dataframes:

WAY 1:

pl.DataFrame({'a':[]}).select(pl.all().implode()).explode('a')
shape: (0, 1)
┌──────┐
│ a    │
│ ---  │
│ null │
╞══════╡
└──────┘

WAY 2:

pl.select(pl.lit([]).alias('a')).explode('a')
shape: (1, 1)
┌──────┐
│ a    │
│ ---  │
│ null │
╞══════╡
│ null │
└──────┘

The inputs look identical, and I have invoked the same operation on both; yet the outputs are different!

Now, you could argue (style not to scale)

Ah! You have only but shown that there is some hidden trickery in how implode operates; some hidden state is being propagated, that makes the series "remember" that it ought to explode back to an initial empty list it originated from, rather than following the usual convention of exploding empty lists to null values. The solution to the inconsistency would be to remove such rigging to restore purity of the operations, rather than to change the convention that empty explodes to null.

But I think that the invariance explode(implode(X)) = X has precedence over explode(empty) = null. Indeed, the fact that some shadowy mechanism has been but in place to satisfy the former in the face of the latter, goes to show that there was such intent already.

mcrumiller commented 7 months ago

Here is an even better example: the empty list [] and the list [null] are not the same thing. An empty list exploded should produce 0 items, not a single null item.

>>> s = pl.Series([[], [None]], dtype=pl.List(pl.UInt8))
>>> s.list.len()
shape: (2,)
Series: '' [u32]
[
        0
        1
]

or alternatively, two different series produce the same exploded output:

>>> pl.Series([[]]).explode()
shape: (1,)
Series: '' [null]
[
        null
]
>>> pl.Series([[None]]).explode()
shape: (1,)
Series: '' [null]
[
        null
]
nameexhaustion commented 7 months ago

This should probably be labelled as an enhancement request "Empty lists should be filtered out during explode rather than becoming NULL rows" instead of a bug. The current behavior is aligned with what pandas does - though they also have an open issue requesting the same change in behavior https://github.com/pandas-dev/pandas/issues/39915.

Interestingly, DuckDB's explode behavior does actually filter out empty lists during explode instead of creating NULL rows, and they also filter out NULL input rows:

>>> df = pl.DataFrame({"x": [[1], [None], [2, 3], [], None]}).with_row_index()
>>> df
shape: (5, 2)
┌───────┬───────────┐
│ index ┆ x         │
│ ---   ┆ ---       │
│ u32   ┆ list[i64] │
╞═══════╪═══════════╡
│ 0     ┆ [1]       │
│ 1     ┆ [null]    │
│ 2     ┆ [2, 3]    │
│ 3     ┆ []        │
│ 4     ┆ null      │
└───────┴───────────┘
>>> df.explode("x")
shape: (6, 2)
┌───────┬──────┐
│ index ┆ x    │
│ ---   ┆ ---  │
│ u32   ┆ i64  │
╞═══════╪══════╡
│ 0     ┆ 1    │
│ 1     ┆ null │
│ 2     ┆ 2    │
│ 2     ┆ 3    │
│ 3     ┆ null │
│ 4     ┆ null │
└───────┴──────┘
>>> duckdb.sql("select index, unnest(x) from df")
┌────────┬───────────┐
│ index  │ unnest(x) │
│ uint32 │   int64   │
├────────┼───────────┤
│      0 │         1 │
│      1 │      NULL │
│      2 │         2 │
│      2 │         3 │
└────────┴───────────┘

Currently, users are able to get the same behavior by performing filter(list.len() > 0) before exploding. Making this the default would cause a breaking change in the API, as well as add to the list of "bewares" for the audience that are coming to polars from a pandas background. But there is definitely a good point around wanting to maintain that list.len().sum() == explode().height

Gattocrucco commented 7 months ago

Shall I open a separate issue for the definitely-a-bug that empty lists created by implodeing empty series then explode to empty series again?

Gattocrucco commented 7 months ago

Making this the default would cause a breaking change in the API

Maybe a new method .catlists() with the new behavior, and .explode() left as-is for pandasers?

mcrumiller commented 7 months ago

Here is how I thought lists are stored internally: as a single array, with offsets:

>>> pl.Series([
    ["a", "b", "c"],            # list1
    ["d", "e", "f"],            # list2
    [],                         # list3
    ["g", "h", "i", "j", "k"],  # list4
])
offset | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
-------|---|---|---|---|---|---|---|---|---|---|---|
value  | a | b | c | d | e | f | g | h | i | j | k |
         ^           ^           ^
      list1        list2       list3
                               list4

list1 starts at offset 0. It has length 3 list2 starts at offset 3. It has length 3. list3 starts at offset 6. It has length 0. list3 starts at offset 6. It has length 5.

An explode() op should just return no items for an empty list, which is functionally the same as filtering, but the filter seems unnecessary. I have not looked at the implementation though.

avimallu commented 7 months ago

I completely understand the consistency requirement for list.len().sum() == explode().height. I did face something like this before, and I was glad that the behavior is the way it is now. For example, let's take the case:

df = pl.DataFrame({
    "id": [5, 6, 7],
    "name": ["Jerry", "Kramer", "George"],
    "professions": [
        ["Comedian"], [], ["Car Parker", "Realtor", "Sales Man", "Architect"]]
})

If I want to expand all the professions that these people have had, I'd do:

df.explode("professions")

What if, in a downstream operation, I needed to count the number of people who are there in the DataFrame?

df.explode("professions").select(pl.col("name").n_unique())

Now this is simply a set of characters in a TV show. What if this was a critical DataFrame that I needed to explode the rows for and use it in conjunction with another column where the Null associated with Kramer meant something?

I think it's harder to get back the information associated with Kramer than it would be to filter out information associated with him, and I would need to know that Polars filters out these cases beforehand, which is unintuitive in this scenario. It however, makes perfect sense in the scenario that this issue was opened with.

mcrumiller commented 7 months ago

@avimallu had to print your dfs to see what was going on here:

>>> df
shape: (3, 3)
┌─────┬────────┬─────────────────────────────────────────────────────┐
│ id  ┆ name   ┆ professions                                         │
│ --- ┆ ---    ┆ ---                                                 │
│ i64 ┆ str    ┆ list[str]                                           │
╞═════╪════════╪═════════════════════════════════════════════════════╡
│ 5   ┆ Jerry  ┆ ["Comedian"]                                        │
│ 6   ┆ Kramer ┆ []                                                  │
│ 7   ┆ George ┆ ["Car Parker", "Realtor", "Sales Man", "Architect"] │
└─────┴────────┴─────────────────────────────────────────────────────┘
>>> df.explode("professions")
shape: (6, 3)
┌─────┬────────┬─────────────┐
│ id  ┆ name   ┆ professions │
│ --- ┆ ---    ┆ ---         │
│ i64 ┆ str    ┆ str         │
╞═════╪════════╪═════════════╡
│ 5   ┆ Jerry  ┆ Comedian    │
│ 6   ┆ Kramer ┆ null        │
│ 7   ┆ George ┆ Car Parker  │
│ 7   ┆ George ┆ Realtor     │
│ 7   ┆ George ┆ Sales Man   │
│ 7   ┆ George ┆ Architect   │
└─────┴────────┴─────────────┘

This makes sense to me, and doesn't require two ways of explode functioning. The reason is because df.explode("profession") is basically short-hand for left joining the dataframe (minus the exploded column) with the exploded column, with the key being the row number. And because row 2 (your ID = 6) does not join with any list elements, it gets assigned to a null.

ritchie46 commented 7 months ago

Can we rephrase this title in a feature request for an explode option that doesn't include nulls for empty lists.

anlagbr commented 6 months ago

I personally use df.column.explode().dropna()