Open heshamdar opened 1 month ago
I've run into similar behaviour with a list of structs:
import polars as pl
original_df = pl.DataFrame(
[
{
"list_of_dicts": [
{"foo": "bar", "qux": "quz"},
{"foo": "baz", "qux": "quy"},
],
},
{
"list_of_dicts": [
{"foo": "bar", "qux": "quz"},
{"foo": "baz", "qux": "quy"},
],
},
{
"list_of_dicts": [
{"foo": "bar", "qux": "quz"},
{"foo": "baz", "qux": "quy"},
],
},
]
)
transformed_df = original_df.with_columns(
pl.col("list_of_dicts").map_elements(
lambda structs: structs, return_dtype=pl.List(pl.Struct), skip_nulls=True
)
)
print(transformed_df)
┌─────────────────┐
│ list_of_dicts │
│ --- │
│ list[struct[0]] │
╞═════════════════╡
│ [] │
│ [] │
│ [] │
└─────────────────┘
Versions of polars before 1.7.0 returned the list of structs correctly:
┌────────────────────────────────┐
│ list_of_dicts │
│ --- │
│ list[struct[2]] │
╞════════════════════════════════╡
│ [{"bar","quz"}, {"baz","quy"}] │
│ [{"bar","quz"}, {"baz","quy"}] │
│ [{"bar","quz"}, {"baz","quy"}] │
└────────────────────────────────┘
Similarly to @heshamdar, I've noticed that setting skip_nulls=False
when running the example with polars >=1.7.0 does return the lists of structs I'm expecting (though it's somewhat confusing why).
@elipinska That example also works without the return_dtype=
original_df.with_columns(
pl.col("list_of_dicts").map_elements(
lambda structs: structs
)
)
# ┌────────────────────────────────┐
# │ list_of_dicts │
# │ --- │
# │ list[struct[2]] │
# ╞════════════════════════════════╡
# │ [{"bar","quz"}, {"baz","quy"}] │
# │ [{"bar","quz"}, {"baz","quy"}] │
# │ [{"bar","quz"}, {"baz","quy"}] │
# └────────────────────────────────┘
It seems pl.Struct
is casting to a struct of "0 fields".
(It's unclear to me if an empty pl.Struct should be allowed at all?)
I've traced the double_list
example to here:
[crates/polars-python/src/series/map.rs:94:17] avs.clone() = [
List(
shape: (1,)
Series: '' [list[null]]
[
[]
],
),
List(
shape: (1,)
Series: '' [list[i64]]
[
[0, 1, 2]
],
),
]
So it seems Series::new
on Line 90 takes type of the first element and that's where the nulls are introduced.
I'm not sure if a different Series constructor is supposed to be used with return_dtype
given instead?
Or explicitly cast the first element to return_dtype
?
It seems 2 completely different codepaths are taken depending on the value of skip_nulls
- it's a bit hard to follow - but that seems to be why there is such odd behaviour going on when it is set.
Checks
Reproducible example
Log output
Issue description
When a
map_elements
operation returns a list of list and theskip_nulls
argument is set toFalse
the type of the first returned element seems to be used. In the case where this is an empty list, all subsequent values will have null values. This isn't the case if eitherskip_nulls
is set toTrue
The latter is a bit confusing, since neither the input or the output should be null.
The example shows the the behaviour under the settings described above, with the only failing case shown in the column
double_list
I think it's perhaps similar to this issue related to this fix, but is still an issue for list of lists (and probably deeper nesting) - https://github.com/pola-rs/polars/pull/18567
Expected behavior
Similar to the single list case, the type for the nested list should be the same as defined in the
return_dtype
argument, and not inferred based on the first result. Also it shouldn't matter whether theskip_nulls
argument isTrue
orFalse
(maybe?)Installed versions