pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.12k stars 1.94k forks source link

`struct.rename_fields` enhancements: correct name count & dict input #10777

Open Julian-J-S opened 1 year ago

Julian-J-S commented 1 year ago

Problem description

Adjusting struct field names currently is a little weird with rename_fields

Length of names parameter

rating_Series = pl.Series(
    "ratings",
    [
        {"Movie": "Cars", "Theatre": "NE", "Avg_Rating": 4.5},
        {"Movie": "Toy Story", "Theatre": "ME", "Avg_Rating": 4.9},
    ],
)

# Start
rating_Series.struct.unnest()
┌───────────┬─────────┬────────────┐
│ Movie     ┆ Theatre ┆ Avg_Rating │
│ ---       ┆ ---     ┆ ---        │
│ str       ┆ str     ┆ f64        │
╞═══════════╪═════════╪════════════╡
│ Cars      ┆ NE      ┆ 4.5        │
│ Toy Story ┆ ME      ┆ 4.9        │
└───────────┴─────────┴────────────┘

# Too many names
rating_Series
.struct.rename_fields(names=['Film', 'State', 'Value', 'hello', 'world'])
.struct.unnest()
┌───────────┬───────┬───────┐
│ Film      ┆ State ┆ Value │
│ ---       ┆ ---   ┆ ---   │
│ str       ┆ str   ┆ f64   │
╞═══════════╪═══════╪═══════╡
│ Cars      ┆ NE    ┆ 4.5   │
│ Toy Story ┆ ME    ┆ 4.9   │
└───────────┴───────┴───────┘

# Too few
rating_Series
.struct.rename_fields(names=['Film'])
.struct.unnest()
┌───────────┐
│ Film      │
│ ---       │
│ str       │
╞═══════════╡
│ Cars      │
│ Toy Story │
└───────────┘

To discuss:

Comparison to Dataframe columns:

Add option to provide a mapping to adjust only selected names

Example: rename_fields({'Movie': 'Film', Theatre': 'State'})

cmdlineluser commented 1 year ago

Too few should error: https://github.com/pola-rs/polars/issues/9052#issuecomment-1564253746

Too few names dropping missing columns is not intended: https://github.com/pola-rs/polars/issues/9052#issuecomment-1564253746

ion-elgreco commented 1 year ago

Too few should error: #9052 (comment)

Why though? A normal rename can do partial renames, shouldn't struct.field_renames behave similarly and keep the other fields but not renamed when no mapping has been passed.

deanm0000 commented 1 year ago

Too few should error: #9052 (comment)

Why though? A normal rename can do partial renames, shouldn't struct.field_renames behave similarly and keep the other fields but not renamed when no mapping has been passed.

It seems the balance is between there being a use case for wanting to rename the first n fields positionally vs simply accidentally feeding too few arguments to the rename.

I know I'm much more likely to be in the latter camp than the former. Additionally, if you are in the former camp and get an error here, you'll know how to address it.

DGolubets commented 6 months ago

Would be great to have rename_fields accept a dict.

cmdlineluser commented 6 months ago

@DGolubets .name.map_fields() has since been added which can help if you're using frames.

df = rating_Series.to_frame()

df.schema["ratings"]
# Struct({'Movie': String, 'Theatre': String, 'Avg_Rating': Float64})

df.with_columns(
   pl.col("ratings").name.map_fields(lambda f:
       {"Movie": "Film", "Theatre": "State"}.get(f, f)
   )
).schema["ratings"]
# Struct({'Film': String, 'State': String, 'Avg_Rating': Float64})
DGolubets commented 6 months ago

@cmdlineluser Great!

DeflateAwning commented 6 months ago

+1 on .rename_fields() supporting a dict argument