pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.91k stars 1.93k forks source link

Incorrect result when using `map_elements` with a list of list return type #18703

Open heshamdar opened 1 month ago

heshamdar commented 1 month ago

Checks

Reproducible example

import polars as pl

def single_list(b):

    if b[0] == 0:
        return []

    return [0,1,2]

def double_list(b):

    if b[0] == 0:
        return [[]]

    return [[0,1,2]]

test = pl.DataFrame({
                     'b':[[0,0,0],[1, 256, 255]]}, 
                    schema={'b':pl.List(pl.Int64)})

result = test.with_columns(

    pl.col('b').map_elements(
        lambda x: single_list(x), 
        return_dtype=pl.List(pl.Int64), 
        skip_nulls=False).alias('single_list'),

    pl.col('b').map_elements(
        lambda x: single_list(x), 
        return_dtype=pl.List(pl.Int64), 
        skip_nulls=True).alias('single_list_skip_null'),

    pl.col('b').map_elements(
        lambda x: double_list(x), 
        return_dtype=pl.List(pl.List(pl.Int64)), 
        skip_nulls=False).alias('double_list'),

    pl.col('b').map_elements(
        lambda x: double_list(x), 
        return_dtype=pl.List(pl.List(pl.Int64)), 
        skip_nulls=True).alias('double_list_skip_null'),
)

print(result)

Log output

shape: (2, 5)
┌───────────────┬─────────────┬──────────────────────┬──────────────────────┬──────────────────────┐
│ b             ┆ single_list ┆ single_list_skip_nul ┆ double_list          ┆ double_list_skip_nul │
│ ---           ┆ ---         ┆ l                    ┆ ---                  ┆ l                    │
│ list[i64]     ┆ list[i64]   ┆ ---                  ┆ list[list[i64]]      ┆ ---                  │
│               ┆             ┆ list[i64]            ┆                      ┆ list[list[i64]]      │
╞═══════════════╪═════════════╪══════════════════════╪══════════════════════╪══════════════════════╡
│ [0, 0, 0]     ┆ []          ┆ []                   ┆ [[]]                 ┆ [[]]                 │
│ [1, 256, 255] ┆ [0, 1, 2]   ┆ [0, 1, 2]            ┆ [[null, null, null]] ┆ [[0, 1, 2]]          │
└───────────────┴─────────────┴──────────────────────┴──────────────────────┴──────────────────────┘

Issue description

When a map_elements operation returns a list of list and the skip_nulls argument is set to False the type of the first returned element seems to be used. In the case where this is an empty list, all subsequent values will have null values. This isn't the case if either

The latter is a bit confusing, since neither the input or the output should be null.

The example shows the the behaviour under the settings described above, with the only failing case shown in the column double_list

I think it's perhaps similar to this issue related to this fix, but is still an issue for list of lists (and probably deeper nesting) - https://github.com/pola-rs/polars/pull/18567

Expected behavior

Similar to the single list case, the type for the nested list should be the same as defined in the return_dtype argument, and not inferred based on the first result. Also it shouldn't matter whether the skip_nulls argument is True or False (maybe?)

Installed versions

``` --------Version info--------- Polars: 1.7.0 Index type: UInt32 Platform: macOS-14.2.1-arm64-arm-64bit Python: 3.10.4 (main, May 28 2022, 12:23:44) [Clang 13.1.6 (clang-1316.0.21.2.5)] ----Optional dependencies---- adbc_driver_manager altair cloudpickle connectorx deltalake fastexcel 0.10.4 fsspec gevent great_tables 0.10.0 matplotlib 3.9.1.post1 nest_asyncio 1.6.0 numpy 1.26.4 openpyxl pandas 2.2.2 pyarrow 16.1.0 pydantic pyiceberg sqlalchemy torch xlsx2csv xlsxwriter ```
elipinska commented 4 weeks ago

I've run into similar behaviour with a list of structs:

Reproducible example

import polars as pl

original_df = pl.DataFrame(
    [
        {
            "list_of_dicts": [
                {"foo": "bar", "qux": "quz"},
                {"foo": "baz", "qux": "quy"},
            ],
        },
        {
            "list_of_dicts": [
                {"foo": "bar", "qux": "quz"},
                {"foo": "baz", "qux": "quy"},
            ],
        },
        {
            "list_of_dicts": [
                {"foo": "bar", "qux": "quz"},
                {"foo": "baz", "qux": "quy"},
            ],
        },
    ]
)

transformed_df = original_df.with_columns(
    pl.col("list_of_dicts").map_elements(
        lambda structs: structs, return_dtype=pl.List(pl.Struct), skip_nulls=True
    )
)

print(transformed_df)

Log output

┌─────────────────┐
│ list_of_dicts   │
│ ---             │
│ list[struct[0]] │
╞═════════════════╡
│ []              │
│ []              │
│ []              │
└─────────────────┘

Versions of polars before 1.7.0 returned the list of structs correctly:

┌────────────────────────────────┐
│ list_of_dicts                  │
│ ---                            │
│ list[struct[2]]                │
╞════════════════════════════════╡
│ [{"bar","quz"}, {"baz","quy"}] │
│ [{"bar","quz"}, {"baz","quy"}] │
│ [{"bar","quz"}, {"baz","quy"}] │
└────────────────────────────────┘

Similarly to @heshamdar, I've noticed that setting skip_nulls=False when running the example with polars >=1.7.0 does return the lists of structs I'm expecting (though it's somewhat confusing why).

cmdlineluser commented 3 weeks ago

@elipinska That example also works without the return_dtype=

original_df.with_columns(
    pl.col("list_of_dicts").map_elements(
        lambda structs: structs
    )
)
# ┌────────────────────────────────┐
# │ list_of_dicts                  │
# │ ---                            │
# │ list[struct[2]]                │
# ╞════════════════════════════════╡
# │ [{"bar","quz"}, {"baz","quy"}] │
# │ [{"bar","quz"}, {"baz","quy"}] │
# │ [{"bar","quz"}, {"baz","quy"}] │
# └────────────────────────────────┘

It seems pl.Struct is casting to a struct of "0 fields".

(It's unclear to me if an empty pl.Struct should be allowed at all?)

cmdlineluser commented 3 weeks ago

I've traced the double_list example to here:

https://github.com/pola-rs/polars/blob/71a8b0543362aadb771c2a446ac93756ea36a124/crates/polars-python/src/series/map.rs#L89-L90

[crates/polars-python/src/series/map.rs:94:17] avs.clone() = [
    List(
        shape: (1,)
        Series: '' [list[null]]
        [
            []
        ],
    ),
    List(
        shape: (1,)
        Series: '' [list[i64]]
        [
            [0, 1, 2]
        ],
    ),
]

So it seems Series::new on Line 90 takes type of the first element and that's where the nulls are introduced.

I'm not sure if a different Series constructor is supposed to be used with return_dtype given instead?

Or explicitly cast the first element to return_dtype?

It seems 2 completely different codepaths are taken depending on the value of skip_nulls - it's a bit hard to follow - but that seems to be why there is such odd behaviour going on when it is set.