pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.66k stars 1.9k forks source link

unnest: allow renaming, make applicable to lists, and allow *all #19110

Open petercordia opened 6 days ago

petercordia commented 6 days ago

Description

Note I am willing to implement this myself if there is interest

Allow renaming:

Currently, if you have a column called 'A' which has dtype Struct({'X': Int64, 'Y': Int64, 'Z': Int64}), when you unnest it you will have columns called 'X' 'Y' and 'Z' instead of 'A'. Sometimes however you will want to have different names, for example to preserve information or to prevent conflicting column names and DuplicateError. I propose that you should be able to provide a renaming pattern or a renaming function, such that for example

df = pl.DataFrame({'A':[{'X':1, 'Y':2}], 'B':[{'Y':20, 'Z':30}]})
[then] df.unnest(columns=['A', 'B'], renaming_pattern = '{col}_{field}')
  [is] df.unnest(columns=['A', 'B'], renaming_function = lambda col, field: f'{col}_{field}')
  [is] pl.DataFrame({'A_X':[1], 'A_Y':[2], 'B_Y':[20], 'B_Z':[30]})

Make applicable to lists:

Unpacking lists isn't conceptually too different from unpacking structs. Currently you first have to call the to_stuct method, and then you can call unnest. I propose that providing a list column name to unnest should just work. The 'field name' should be either the index, or the index+1. Because unpacking lists is somewhat different from unpacking lists, it would seem reasonable that it should be possible to provide a struct_renaming_function and a list_renaming_function separately. Or to provide a struct_renaming_pattern and a list_renaming_pattern. (Or a pattern and a function if you really want to.) A list_renamingpattern could look like for example `'{col}{index}'. It could be reasonable to allow'list_renamingpattern='{col}{index1}'as a shorthand for'list_renamingfunction = lambda col, index: f'{col}{index+1}'`. Example:

df = pl.DataFrame({'A':[[1,2]]})
[then] df.unnest(columns=['A'], renaming_pattern = '{col}_{index1}')
  [is] pl.DataFrame({'A_1':[1], 'A_2':[2]})

Allow *all:

As mentioned in https://github.com/pola-rs/polars/issues/18936, sometimes you want to unnest everything. Quite frequently actually. There should be a shorthand for doing so. I propose changing the signature from columns: 'ColumnNameOrSelector | Collection[ColumnNameOrSelector]' to columns: Literal['all', 'structs', 'lists'] | 'ColumnNameOrSelector | Collection[ColumnNameOrSelector]' where it should be pretty obvious what the 3 added options do. This would mean you no longer have to go through the schema selecting all columns with the Struct dtype.

Backwards compatibility:

The current signature of unnest is

df.unnest(
    'ColumnNameOrSelector | Collection[ColumnNameOrSelector]',
    *more_columns: 'ColumnNameOrSelector',
) -> 'DataFrame'

I would add 3 literal options to the first argument, and a couple of keyword-only arguments. I believe that this way all currently working calls to .unnest() would keep doing what they do. I can think of try-except constructions that would be broken by this enhancement proposal, but I don't expect those to be common.

Proposed implementation details:

I would rename the current .unnest to ._unnest. I would then define .unnest as a pure python function that does some pre-processing before calling ._unnest.

A detail I am unsure of is how easy or hard it would be to rename .unnest to ._unnest. I have written Rust before, but I have never worked on a Python-Rust interface.

Final notes:

As I said at the start, I am willing to implement this. If you have any opinions on the details I proposed, I would like to hear them. If there are any conditions that I should satisfy to gain support for this change, I will try to comply. If there is no chance whatsoever of this making it into main, I would love to hear, because then I'll stop wasting my time on this ;)

cmdlineluser commented 6 days ago

There may be some relevant discussion here:

Since then .name.* added some struct methods^1.

df.select(
    pl.col("A").name.prefix_fields("A_"),
    pl.col("B").name.prefix_fields("B_")
).unnest("A", "B")

# shape: (1, 4)
# ┌─────┬─────┬─────┬─────┐
# │ A_X ┆ A_Y ┆ B_Y ┆ B_Z │
# │ --- ┆ --- ┆ --- ┆ --- │
# │ i64 ┆ i64 ┆ i64 ┆ i64 │
# ╞═════╪═════╪═════╪═════╡
# │ 1   ┆ 2   ┆ 20  ┆ 30  │
# └─────┴─────┴─────┴─────┘

It requires passing literal strings in though, so cannot work with multi-cols.

I'm not sure if it would be possible to allow the prefix to be inferred from the root name if not given, like:

pl.col("A", "B").name.prefix_fields() # (prefix=None, separator="_")

But that would only deal with 1-level of nesting, so may be the wrong idea.

I think another issue is that one cannot do pl.col(pl.Struct("*")) yet.


petercordia commented 4 days ago

Thanks for pointing this out.

I must've used the wrong search terms to not find these discussion myself (?)

https://github.com/pola-rs/polars/issues/7078#issuecomment-2258225305 in particular lists code that's in the direction I was thinking of.

I see this is quite an active topic, but at the same time I'm not sure how much has actually been implemented in the last year. (Some of these requests are over a year old.)

My proposal is slightly different from the others I've seen. I like my own better of course :relaxed: . But unless other people let me know they do to, it's not likely that this particular version is getting into main, so I don't think I'll be putting much work into this.

cmdlineluser commented 3 days ago

I must've used the wrong search terms

For me, searching "unnest all" - the issue is half way down page 2 of the results. So it seems it is not so easy to find.

I'm not sure how much has actually been implemented

I don't think anything has been, it seems to be a topic of low priority for the devs. (understandably)

.struct.field("*") wildcard unpacking is also another recent addition.

df.with_columns(
    pl.col("A").name.prefix_fields("A."),
    pl.col("B").name.prefix_fields("B.")
).select(pl.all().struct.field("*"))
shape: (1, 4)
┌─────┬─────┬─────┬─────┐
│ A.X ┆ A.Y ┆ B.Y ┆ B.Z │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╡
│ 1   ┆ 2   ┆ 20  ┆ 30  │
└─────┴─────┴─────┴─────┘

So perhaps it would be useful to have the functionality available generally instead of putting it inside of just unnest().