Open DaiZack opened 5 months ago
There seems to be a difference with how Series/DataFrames handle nested data - which I think causes this?
data = [
{"match": {"country_code": "A"}},
{"match": {"region_code": "B"}}
]
pl.Series(data)
# shape: (2,)
# Series: '' [struct[1]]
# [
# {{"A"}}
# {{null}}
# ]
pl.DataFrame(data)
# shape: (2, 1)
# ┌────────────┐
# │ match │
# │ --- │
# │ struct[2] │
# ╞════════════╡
# │ {"A",null} │
# │ {null,"B"} │
# └────────────┘
So I think the data is "gone" before return_dtype=
comes into play - which is why it appears to do nothing in this case?
Perhaps pl.DataFrame().to_struct()
could be a workaround.
def search_all(state, country):
# Return list of dictionaries, each with a single "match" field
return pl.DataFrame([
{"match": {"country_code": country}},
{"match": {"region_code": f"US-{state}"}},
]).to_struct()
The result still the same
import polars as pl
def search_all(state, country):
# Return list of dictionaries, each with a single "match" field
return pl.DataFrame(
[
{"match": {"country_code": country}},
{"match": {"region_code": f"US-{state}"}},
]
).to_struct()
df = pl.DataFrame(
[
{"state": "CA", "country": "USA"},
{"state": "TX", "country": "USA"},
]
)
# Adjust the return_dtype to accurately reflect the output structure of search_all
df = df.with_columns(
[
pl.struct(["state", "country"])
.map_elements(lambda s: search_all(**s))
.alias("query"),
]
)
output: shape: (2, 3) ┌───────┬─────────┬───────────────────────────────────┐ │ state ┆ country ┆ query │ │ --- ┆ --- ┆ --- │ │ str ┆ str ┆ list[struct[1]] │ ╞═══════╪═════════╪═══════════════════════════════════╡ │ CA ┆ USA ┆ [{{"USA",null}}, {{null,"US-CA"}… │ │ TX ┆ USA ┆ [{{"USA",null}}, {{null,"US-TX"}… │ └───────┴─────────┴───────────────────────────────────┘
The result still the same
The initial output is:
shape: (2, 3)
┌───────┬─────────┬───────────────────────┐
│ state ┆ country ┆ query │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list[struct[1]] │
╞═══════╪═════════╪═══════════════════════╡
│ CA ┆ USA ┆ [{{"USA"}}, {{null}}] │
│ TX ┆ USA ┆ [{{"USA"}}, {{null}}] │
└───────┴─────────┴───────────────────────┘
Wrapping it in a frame produces:
shape: (2, 3)
┌───────┬─────────┬────────────────────────────────────┐
│ state ┆ country ┆ query │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ list[struct[1]] │
╞═══════╪═════════╪════════════════════════════════════╡
│ CA ┆ USA ┆ [{{"USA",null}}, {{null,"US-CA"}}] │
│ TX ┆ USA ┆ [{{"USA",null}}, {{null,"US-TX"}}] │
└───────┴─────────┴────────────────────────────────────┘
Sorry about the previous comment. They are different, but still not as the expected (like the pandas output)
[{'match': {'country_code': 'USA'}}, {'match': {'region_code': 'US-CA'}}]
it force to add country_code and region_code for both "match"s. and set the null for the non existing one.
I can work around with convert it to a json string. But the behaviour is still not expected, may cause surprise in the result.
In this example, the format is defined by elasticsearch query. If adding null, the query will fail.
Oh right - Yeah, Polars differs from Pandas in that regard.
If you allow Polars to process the data as lists/structs then all values must have the same dtype:
>>> df.schema["query"]
List(Struct({'match': Struct({'country_code': String, 'region_code': String})}))
You cannot have the Pandas output (unless you use pl.Object
)
OK. But that seems painful, since sometimes you just forget the rule and expect same out as your function, and then get nuts.🥜
Checks
Reproducible example
Log output
Issue description
I was trying to generate elasticsearch query from dataframe, the required format include inconsistent key for same mapping of "math" in a list. I tried to add return type with the nested struct, the result still not good.
Expected behavior
Installed versions