pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.7k stars 1.9k forks source link

`.unnest_all()` #12353

Open cmdlineluser opened 11 months ago

cmdlineluser commented 11 months ago

Description

Requests for this functionality (or a subset of) exist across quite a few issues (and several Stack Overflow questions):

unnest_all has cropped up a few times, so I've just chosen that name as a placeholder.

The basic use case is to allow:

df = pl.DataFrame({
    "x": [{"foo":{"a": 1, "b": 2}}, {"foo":{"a": 3, "b": 4}}],
    "y": [{"bar":{"a": 5, "b": 6}}, {"bar":{"a": 7, "b": 8}}]
})

df.pipe(unnest_all)

# shape: (2, 4)
# ┌─────────┬─────────┬─────────┬─────────┐
# │ x.foo.a ┆ x.foo.b ┆ y.bar.a ┆ y.bar.b │
# │ ---     ┆ ---     ┆ ---     ┆ ---     │
# │ i64     ┆ i64     ┆ i64     ┆ i64     │
# ╞═════════╪═════════╪═════════╪═════════╡
# │ 1       ┆ 2       ┆ 5       ┆ 6       │
# │ 3       ┆ 4       ┆ 7       ┆ 8       │
# └─────────┴─────────┴─────────┴─────────┘

My latest attempt at a Python helper for this is to walk the schema to build the expressions:

def _unnest_all(schema, separator):
    def _unnest(schema, path=[]):
        for name, dtype in schema.items():
            base_type = dtype.base_type()

            if base_type == pl.Struct:
                yield from _unnest(dtype.to_schema(), path + [name])
            else:
                yield path + [name], dtype

    for (col, *fields), dtype in _unnest(schema):
        expr = pl.col(col)

        for field in fields:
            expr = expr.struct[field]

        if col == "":
            name = separator.join(fields)
        else:
            name = separator.join([col] + fields)

        yield expr.alias(name)

def unnest_all(df, separator="."): 
    return df.select(_unnest_all(df.schema, separator))

However, I think the real benefit of this functionality (and the reason for this issue) is that it allows Polars to be used for interactively exploring nested data.

an interesting example, polars expressions:

pl.debug = lambda self: pl.select(pl.lit(self.meta.write_json()).str.json_extract().alias("")).pipe(unnest_all)

pl.debug((pl.col("a") + pl.col("b")).over("c", "d"))

# shape: (1, 5)
# ┌────────────────────────────────────────┬───────────────────────────────┬─────────────────────────────────────────┬─────────────────────┬─────────────────────┐
# │ Window.function.BinaryExpr.left.Column ┆ Window.function.BinaryExpr.op ┆ Window.function.BinaryExpr.right.Column ┆ Window.partition_by ┆ Window.options.Over │
# │ ---                                    ┆ ---                           ┆ ---                                     ┆ ---                 ┆ ---                 │
# │ str                                    ┆ str                           ┆ str                                     ┆ list[struct[1]]     ┆ str                 │
# ╞════════════════════════════════════════╪═══════════════════════════════╪═════════════════════════════════════════╪═════════════════════╪═════════════════════╡
# │ a                                      ┆ Plus                          ┆ b                                       ┆ [{"c"}, {"d"}]      ┆ GroupsToRows        │
# └────────────────────────────────────────┴───────────────────────────────┴─────────────────────────────────────────┴─────────────────────┴─────────────────────┘

Using Polars to load "JSON" data in the REPL and interactively explore it with .unnest_all() and .explode() is rather nice.

A proper implementation of this would be super useful.

itay747 commented 10 months ago

A high performance port of pd.json_normalize would be really useful. There isn't any Python library that actually does json normalization particularly well, including the pandas one, because it falls short of handling the common case of arbitrarily deeply nested (k,v) being a str key and a v -> Iterable[dict[str, str | float | int]] type. In the pandas case, the "normalized" result will be emit v, which really falls short of what is possible/expected.

There should be specific logic here that does heuristic inspection on v to avoid doing the flatten-dict's approach of using integers as keys, which results in a crude a.0.id key in the flattened dictionary. So this could be something like

1) Find a .*id.* key in the list[dict[str, Any] and then 2) Verify all entries in the list have that key present, with unique and atomic values. 3) Failing that, default back to flatten-dict's approach.

Perhaps this behaviour could be customizable with a parameter to the eventual pl.normalize_json function that accepts a regex for dealing with this, but let's limit it to just this parameter. I would want to avoid the pandas approach of accepting a meta/meta_prefix/record_prefix argument - it's simply too verbose in practice for deeply nested dictionaries, not to mention makes the assumption that these keys even exist, in the context of a function that is specifically made for making deeply nested JSON objects manageable.

Oh yeah, and above all: This function should follow the standard source: str | Path | IOBase | bytes function ParamSpec in line with other polars.io.read_* ops.

Wainberg commented 9 months ago

This seems really useful! How about just df.unnest() with no arguments, rather than df.unnest_all()?

niccolopetti commented 8 months ago

Just had to use this on my project, I hope to see it merged soon

cmdlineluser commented 8 months ago

Yeah, I just took the name to use as a placeholder.

DuckDB seems to have a recursive parameter for it.

duckdb.sql("""
from df 
select unnest(x), unnest(y)
""")

# ┌────────────────────────────┬────────────────────────────┐
# │            foo             │            bar             │
# │ struct(a bigint, b bigint) │ struct(a bigint, b bigint) │
# ├────────────────────────────┼────────────────────────────┤
# │ {'a': 1, 'b': 2}           │ {'a': 5, 'b': 6}           │
# │ {'a': 3, 'b': 4}           │ {'a': 7, 'b': 8}           │
# └────────────────────────────┴────────────────────────────┘
duckdb.sql("""
from df 
select unnest(x, recursive := true), unnest(y, recursive := true)
""")

# ┌───────┬───────┬───────┬───────┐
# │   a   │   b   │   a   │   b   │
# │ int64 │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┼───────┤
# │     1 │     2 │     5 │     6 │
# │     3 │     4 │     7 │     8 │
# └───────┴───────┴───────┴───────┘

(although it doesn't seem possible to keep the "path")

Wainberg commented 8 months ago

Worth noting that DuckDB's recursive is actually a mix of polars's unnest and polars's list.explode:

-- unnesting a list of lists recursively, generating 5 rows (1, 2, 3, 4, 5)
SELECT unnest([[1, 2, 3], [4, 5]], recursive := true);
-- unnesting a list of structs recursively, generating two rows of two columns (a, b)
SELECT unnest([{'a': 42, 'b': 84}, {'a': 100, 'b': NULL}], recursive := true);
-- unnesting a struct, generating two columns (a, b)
SELECT unnest({'a': [1, 2, 3], 'b': 88}, recursive := true);

It would be nice to unnest a single level with plain unnest() with no arguments. An optional extension would be to allow some kind of recursive unnesting with recursive=True. But those are two different things.

fzyzcjy commented 6 months ago

Looking forward to seeing it merged!

P.S. Spent a minute slightly modified it to avoid some linter warnings and add type info, pasted here in case someone needs it:

``` import polars as pl def unnest_all(df: pl.DataFrame, separator=".") -> pl.DataFrame: """https://github.com/pola-rs/polars/issues/12353""" return df.select(_unnest_all(df.schema, separator)) def _unnest_all(schema, separator): for (col, *fields), dtype in _unnest(schema, []): expr = pl.col(col) for field in fields: expr = expr.struct[field] if col == "": name = separator.join(fields) else: name = separator.join([col] + fields) yield expr.alias(name) def _unnest(schema, path): for name, dtype in schema.items(): base_type = dtype.base_type() if base_type == pl.Struct: yield from _unnest(dtype.to_schema(), path + [name]) else: yield path + [name], dtype ```
mallport commented 6 months ago

Thanks @cmdlineluser and @fzyzcjy for sharing, that code snippet was useful to me! Very slick.

The inverse operation of "unnest_all" would also be very useful - re-nesting normalized columns based on a separator.

MarcoGorelli commented 4 months ago

from discussion:

  1. there should be a selector for struct dtype
  2. unnest() (without arguments) should probably just use that (rather than adding unnest_all)
daviewales commented 2 months ago

It would be nice to be able to recursively unnest both lists and structs with an automatic prefix based on the column name. I've commented some functions to do this in another issue: https://github.com/pola-rs/polars/issues/7078#issuecomment-2258225305.