pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30k stars 1.94k forks source link

Wrong output when using map_elements with list of struct type #16003

Open DaiZack opened 5 months ago

DaiZack commented 5 months ago

Checks

Reproducible example

import polars as pl

def search_all(state, country):
    # Return list of dictionaries, each with a single "match" field
    return [
        {"match": {"country_code": country}},
        {"match": {"region_code": f"US-{state}"}},
    ]

df = pl.DataFrame(
    [
        {"state": "CA", "country": "USA"},
        {"state": "TX", "country": "USA"},
    ]
)

# Adjust the return_dtype to accurately reflect the output structure of search_all
df = df.with_columns(
    [
        pl.struct(["state", "country"])
        .map_elements(lambda s: search_all(**s))
        .alias("query"),
    ]
)

print(df)

Log output

shape: (2, 3)
┌───────┬─────────┬───────────────────────┐
│ state ┆ country ┆ query                 │
│ ---   ┆ ---     ┆ ---                   │
│ str   ┆ str     ┆ list[struct[1]]       │
╞═══════╪═════════╪═══════════════════════╡
│ CA    ┆ USA     ┆ [{{"USA"}}, {{null}}] │
│ TX    ┆ USA     ┆ [{{"USA"}}, {{null}}] │
└───────┴─────────┴───────────────────────┘

Issue description

I was trying to generate elasticsearch query from dataframe, the required format include inconsistent key for same mapping of "math" in a list. I tried to add return type with the nested struct, the result still not good.

Expected behavior

import pandas as pd

def search_all(state, country):
    # Return list of dictionaries, each with a single "match" field
    return [
        {"match": {"country_code": country}},
        {"match": {"region_code": f"US-{state}"}},
    ]

dfp = pd.DataFrame(
    [
        {
            "state": "CA",
            "country": "USA",
        },
        {
            "state": "TX",
            "country": "USA",
        },
    ]
)

dfp.apply(lambda s: search_all(**s), axis=1)[0]
[{'match': {'country_code': 'USA'}}, {'match': {'region_code': 'US-CA'}}]

Installed versions

``` 0.20.21 ```
cmdlineluser commented 5 months ago

There seems to be a difference with how Series/DataFrames handle nested data - which I think causes this?

data = [
    {"match": {"country_code": "A"}},
    {"match": {"region_code":  "B"}}
]

pl.Series(data)
# shape: (2,)
# Series: '' [struct[1]]
# [
#   {{"A"}}
#   {{null}}
# ]

pl.DataFrame(data)
# shape: (2, 1)
# ┌────────────┐
# │ match      │
# │ ---        │
# │ struct[2]  │
# ╞════════════╡
# │ {"A",null} │
# │ {null,"B"} │
# └────────────┘

So I think the data is "gone" before return_dtype= comes into play - which is why it appears to do nothing in this case?

Perhaps pl.DataFrame().to_struct() could be a workaround.

def search_all(state, country):
    # Return list of dictionaries, each with a single "match" field
    return pl.DataFrame([
        {"match": {"country_code": country}},
        {"match": {"region_code": f"US-{state}"}},
    ]).to_struct()
DaiZack commented 5 months ago

The result still the same

import polars as pl

def search_all(state, country):
    # Return list of dictionaries, each with a single "match" field
    return pl.DataFrame(
        [
            {"match": {"country_code": country}},
            {"match": {"region_code": f"US-{state}"}},
        ]
    ).to_struct()

df = pl.DataFrame(
    [
        {"state": "CA", "country": "USA"},
        {"state": "TX", "country": "USA"},
    ]
)

# Adjust the return_dtype to accurately reflect the output structure of search_all
df = df.with_columns(
    [
        pl.struct(["state", "country"])
        .map_elements(lambda s: search_all(**s))
        .alias("query"),
    ]
)

output: shape: (2, 3) ┌───────┬─────────┬───────────────────────────────────┐ │ state ┆ country ┆ query │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ list[struct[1]] │ ╞═══════╪═════════╪═══════════════════════════════════╡ │ CA ┆ USA ┆ [{{"USA",null}}, {{null,"US-CA"}… │ │ TX ┆ USA ┆ [{{"USA",null}}, {{null,"US-TX"}… │ └───────┴─────────┴───────────────────────────────────┘

:1: MapWithoutReturnDtypeWarning: Calling `map_elements` without specifying `return_dtype` can lead to unpredictable results. Specify `return_dtype` to silence this warning. df = df.with_columns(
cmdlineluser commented 5 months ago

The result still the same

The initial output is:

shape: (2, 3)
┌───────┬─────────┬───────────────────────┐
│ state ┆ country ┆ query                 │
│ ---   ┆ ---     ┆ ---                   │
│ str   ┆ str     ┆ list[struct[1]]       │
╞═══════╪═════════╪═══════════════════════╡
│ CA    ┆ USA     ┆ [{{"USA"}}, {{null}}] │
│ TX    ┆ USA     ┆ [{{"USA"}}, {{null}}] │
└───────┴─────────┴───────────────────────┘

Wrapping it in a frame produces:

shape: (2, 3)
┌───────┬─────────┬────────────────────────────────────┐
│ state ┆ country ┆ query                              │
│ ---   ┆ ---     ┆ ---                                │
│ str   ┆ str     ┆ list[struct[1]]                    │
╞═══════╪═════════╪════════════════════════════════════╡
│ CA    ┆ USA     ┆ [{{"USA",null}}, {{null,"US-CA"}}] │
│ TX    ┆ USA     ┆ [{{"USA",null}}, {{null,"US-TX"}}] │
└───────┴─────────┴────────────────────────────────────┘
DaiZack commented 5 months ago

Sorry about the previous comment. They are different, but still not as the expected (like the pandas output)

[{'match': {'country_code': 'USA'}}, {'match': {'region_code': 'US-CA'}}]

it force to add country_code and region_code for both "match"s. and set the null for the non existing one.

I can work around with convert it to a json string. But the behaviour is still not expected, may cause surprise in the result.

In this example, the format is defined by elasticsearch query. If adding null, the query will fail.

cmdlineluser commented 5 months ago

Oh right - Yeah, Polars differs from Pandas in that regard.

If you allow Polars to process the data as lists/structs then all values must have the same dtype:

>>> df.schema["query"]
List(Struct({'match': Struct({'country_code': String, 'region_code': String})}))

You cannot have the Pandas output (unless you use pl.Object)

DaiZack commented 5 months ago

OK. But that seems painful, since sometimes you just forget the rule and expect same out as your function, and then get nuts.🥜