pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.06k stars 1.83k forks source link

Select nested struct member #3123

Open cjermain opened 2 years ago

cjermain commented 2 years ago

While nested structs are supported in a DataFrame, it is difficult to select nested columns. This feature request describes a few ideas for making this easier. I'm happy to explore implementing these suggestions.

I would like to improve upon this current approach:

import polars as pl

# Nest two series into a struct column
df = pl.DataFrame({'b': [1, 2, 3], 'c': [0, 0, 0]}).to_struct('a').to_frame()

# Select 'a' then 'b'
df.select(pl.col('a').struct.field('b'))

In PySpark the dot-notation can be used to select a nested column. This is extremely useful for deeply nested fields. In the example above, I suggest using this PySpark-like syntax:

# Select 'b' within 'a'
df.select(pl.col('a.b'))

The column name would retain the a. prefix unless an alias is applied, so it would result in the selection of a.b.

Unnesting all fields from the a struct can be done using pl.col('a.*'), which is equivalent to df.select(pl.col('a')).unnest('a') with the exception that the field names again retain a. as a prefix.

These suggestions are similar to the JSONPath notation supported by json_path_match on a Series string. I can see value in some of the more complex syntax that JSONPath supports, but I think the above 2 examples would cover a large share of cases.

alexander-beedie commented 2 years ago

Nice idea; I'd suggest not overloading col for this purpose though, as there are plenty of valid column names that may contain a ".". If it happens, perhaps something like pl.path would be appropriate, alluding to the JSONPath similarity?

ritchie46 commented 2 years ago

I understand the rationale for this feature, but I don't want this under the col expression. As @alexander-beedie says, this can have strange side effects with column names. I feel more for a new function or overloading the pl.struct function.

cjermain commented 2 years ago

Thanks for the feedback. I understand the concern of shadowing, but I think with the right logic this would not introduce confusion.

Here is example logic:

With a bias toward the explicit columns, this is only enacted when there will otherwise be an exception raised. Even in the cases someone chooses to name their column a.*, the logic would pick up the column over the expansion. When selecting a.b from the nested struct in the example, the new DataFrame would have the explicit name of a.b, so col would consistently find the right field.

This is currently a major pain point for using Polars with nested parquet, and people coming from PySpark will try to use this syntax. I think col is the most obvious place to discover this feature.

ritchie46 commented 2 years ago

if the column name exists, use the column elif the name is dot separated, split the name and look for the column as nested raise not found error

Hmm.. I don't like it. This means that a query is dependent on the state of the DataFrame which is not visible from reading the code. This is against polars philosophy. It should be clear from reading code what polars will select and not change if we add a similar named column.

Given that we are creating a new query language for nested data. I think we should extend on this in a separate selector.

Some names that may fit:

Besides field access we might also index List typed arrays: a[0] Or if field b of a is a list: a.b[0].

Edit

To help users coming from pyspark, we could improve error information when they select col("a.b") with an error hinting at the name of the function.

alexander-beedie commented 2 years ago

Just to note; even with a bias in one direction, unresolvable ambiguity would remain - for example:

DataFrame has struct col x with attributes y and z, and also a column named x.y. If you rely only on col and bias in one direction you would either be unable to reference the y attribute of the struct col (as you'd return column x.y), or if you bias in the other direction you'd be unable to reference column x.y as you'd get the y attribute of struct x.

If I recall correctly, PySpark requires colnames containing dots to be quoted using backtick (eg: `x.y`) to avoid exactly this kind of problem, but that isn't an option in Polars, hence mapping the dot syntax 1:1 into col can't really work reliably. Just my 2¢... ;)

I'm all in favour of a nice/lean selection syntax though, as I suspect struct support is something we'll be looking to take advantage of in the not-too-distant future 👍

cjermain commented 2 years ago

These are good points -- I see what you are both saying. I definitely agree on the error message improvement helping to direct people toward this feature. I like the idea of pl.unnest, since it is explicit about the action being taken (nested columns within a struct are being unnested/extracted) -- a more verbose name could be pl.extract_col, but perhaps that is too long. I also like pl.path, but I think it is less clear as to how the path is being used. With this being a different function I can see value in trying to support the broader features of JSONPath.

ophiry commented 9 months ago

while it would be nice to have this as a native feature, it's pretty simple to implement this as an external function for example: https://gist.github.com/ophiry/78e6e04a8fde01e58ee289febf3bc4cc

theelderbeever commented 3 months ago

For anyone that lands here the following works in 0.20.31 (probably earlier but thats the version I am on.

df.select(pl.col('a').struct['*'])