Open cmdlineluser opened 11 months ago
A high performance port of pd.json_normalize
would be really useful. There isn't any Python library that actually does json normalization particularly well, including the pandas one, because it falls short of handling the common case of arbitrarily deeply nested (k,v)
being a str
key and a v
-> Iterable[dict[str, str | float | int]]
type. In the pandas case, the "normalized" result will be emit v
, which really falls short of what is possible/expected.
There should be specific logic here that does heuristic inspection on v
to avoid doing the flatten-dict
's approach of using integers as keys, which results in a crude a.0.id
key in the flattened dictionary. So this could be something like
1) Find a .*id.*
key in the list[dict[str, Any]
and then
2) Verify all entries in the list
have that key present, with unique and atomic values.
3) Failing that, default back to flatten-dict
's approach.
Perhaps this behaviour could be customizable with a parameter to the eventual pl.normalize_json
function that accepts a regex for dealing with this, but let's limit it to just this parameter. I would want to avoid the pandas approach of accepting a meta
/meta_prefix
/record_prefix
argument - it's simply too verbose in practice for deeply nested dictionaries, not to mention makes the assumption that these keys even exist, in the context of a function that is specifically made for making deeply nested JSON objects manageable.
Oh yeah, and above all: This function should follow the standard source: str | Path | IOBase | bytes
function ParamSpec in line with other polars.io.read_*
ops.
This seems really useful! How about just df.unnest()
with no arguments, rather than df.unnest_all()
?
Just had to use this on my project, I hope to see it merged soon
Yeah, I just took the name to use as a placeholder.
DuckDB seems to have a recursive
parameter for it.
duckdb.sql("""
from df
select unnest(x), unnest(y)
""")
# ┌────────────────────────────┬────────────────────────────┐
# │ foo │ bar │
# │ struct(a bigint, b bigint) │ struct(a bigint, b bigint) │
# ├────────────────────────────┼────────────────────────────┤
# │ {'a': 1, 'b': 2} │ {'a': 5, 'b': 6} │
# │ {'a': 3, 'b': 4} │ {'a': 7, 'b': 8} │
# └────────────────────────────┴────────────────────────────┘
duckdb.sql("""
from df
select unnest(x, recursive := true), unnest(y, recursive := true)
""")
# ┌───────┬───────┬───────┬───────┐
# │ a │ b │ a │ b │
# │ int64 │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┼───────┤
# │ 1 │ 2 │ 5 │ 6 │
# │ 3 │ 4 │ 7 │ 8 │
# └───────┴───────┴───────┴───────┘
(although it doesn't seem possible to keep the "path")
Worth noting that DuckDB's recursive
is actually a mix of polars's unnest
and polars's list.explode
:
-- unnesting a list of lists recursively, generating 5 rows (1, 2, 3, 4, 5)
SELECT unnest([[1, 2, 3], [4, 5]], recursive := true);
-- unnesting a list of structs recursively, generating two rows of two columns (a, b)
SELECT unnest([{'a': 42, 'b': 84}, {'a': 100, 'b': NULL}], recursive := true);
-- unnesting a struct, generating two columns (a, b)
SELECT unnest({'a': [1, 2, 3], 'b': 88}, recursive := true);
It would be nice to unnest a single level with plain unnest()
with no arguments. An optional extension would be to allow some kind of recursive unnesting with recursive=True
. But those are two different things.
Looking forward to seeing it merged!
P.S. Spent a minute slightly modified it to avoid some linter warnings and add type info, pasted here in case someone needs it:
Thanks @cmdlineluser and @fzyzcjy for sharing, that code snippet was useful to me! Very slick.
The inverse operation of "unnest_all" would also be very useful - re-nesting normalized columns based on a separator.
from discussion:
unnest()
(without arguments) should probably just use that (rather than adding unnest_all
)It would be nice to be able to recursively unnest both lists and structs with an automatic prefix based on the column name. I've commented some functions to do this in another issue: https://github.com/pola-rs/polars/issues/7078#issuecomment-2258225305.
Description
Requests for this functionality (or a subset of) exist across quite a few issues (and several Stack Overflow questions):
unnest_all
has cropped up a few times, so I've just chosen that name as a placeholder.The basic use case is to allow:
My latest attempt at a Python helper for this is to walk the schema to build the expressions:
However, I think the real benefit of this functionality (and the reason for this issue) is that it allows Polars to be used for interactively exploring nested data.
an interesting example, polars expressions:
Using Polars to load "JSON" data in the REPL and interactively explore it with
.unnest_all()
and.explode()
is rather nice.A proper implementation of this would be super useful.