pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.58k stars 1.98k forks source link

Add `json_encode` to `pl.List` #14029

Open david-waterworth opened 10 months ago

david-waterworth commented 10 months ago

Description

pl.Struct contains a json_encode / json_decode which amongst other things is useful if you want to dump a table to csv but it contains nested fields.

It would be useful if pl.List also contains at least json_encode (json_decode might be trickier as it would have to check that the json was a list).

At the moment my workaround is

df = pl.DataFrame(
    {
        "a": [1, 2, 3, 1],
        "b": [["a"], ["b"], ["c"], ["c"]],
    }
)

df.with_columns(pl.col("b").map_elements(lambda e: json.dumps(list(e))))

I considered using to_struct but that doesn't preserve the list structure.

Also the error from my original attempt

df.with_columns(pl.col("b").map_elements(lambda e: json.dumps(e)))

ComputeError: TypeError: Object of type Series is not JSON serializable

I guess internally a list is a series but that was surprising, Ideally the engine would internally cast to a python list before invoking the function being used by map_elements?

cmdlineluser commented 10 months ago

Yeah, I think I mentioned it at the time but may have gotten lost in translation.

Essentially the equivalent of:

df = pl.DataFrame({
    "a": [["1", "2"], ["3", "4"]],
    "b": [[dict(B=1)], [dict(B=2)]],
    "c": [dict(C=3), dict(C=4)],
    "d": [datetime.date.today(), None],
    "e": [5, 6]
})

# shape: (2, 5)
# ┌────────────┬─────────────────┬───────────┬────────────┬─────┐
# │ a          ┆ b               ┆ c         ┆ d          ┆ e   │
# │ ---        ┆ ---             ┆ ---       ┆ ---        ┆ --- │
# │ list[str]  ┆ list[struct[1]] ┆ struct[1] ┆ date       ┆ i64 │
# ╞════════════╪═════════════════╪═══════════╪════════════╪═════╡
# │ ["1", "2"] ┆ [{1}]           ┆ {3}       ┆ 2024-01-27 ┆ 5   │
# │ ["3", "4"] ┆ [{2}]           ┆ {4}       ┆ null       ┆ 6   │
# └────────────┴─────────────────┴───────────┴────────────┴─────┘
duckdb.sql("""
from df 
select a::json, b::json, c::json, d::json, e::json
""").pl()

# shape: (2, 5)
# ┌───────────────────┬───────────────────┬───────────────────┬───────────────────┬───────────────────┐
# │ CAST(a AS "json") ┆ CAST(b AS "json") ┆ CAST(c AS "json") ┆ CAST(d AS "json") ┆ CAST(e AS "json") │
# │ ---               ┆ ---               ┆ ---               ┆ ---               ┆ ---               │
# │ str               ┆ str               ┆ str               ┆ str               ┆ str               │
# ╞═══════════════════╪═══════════════════╪═══════════════════╪═══════════════════╪═══════════════════╡
# │ ["1","2"]         ┆ [{"B":1}]         ┆ {"C":3}           ┆ "2024-01-27"      ┆ 5                 │
# │ ["3","4"]         ┆ [{"B":2}]         ┆ {"C":4}           ┆ null              ┆ 6                 │
# └───────────────────┴───────────────────┴───────────────────┴───────────────────┴───────────────────┘

(decoding is .str.json_decode() - it's not struct specific)

(I'm assuming .map_elements requires explicit conversion to list by the user for performance reasons.)

david-waterworth commented 10 months ago

Yeah after I posted the example I realised I also need to cast/encode a list[struct[*]] as well.

deanm0000 commented 10 months ago

map_elements doesn't work b/c to make that into json it needs the whole column at once and map_elements works on each item at a time.

Instead use map_batches like this:

df.select(pl.col('b').map_batches(lambda x: json.dumps(x.to_list())))
david-waterworth commented 10 months ago

@deanm0000 thanks for the suggestion but I actually want it to process one item at a time, I don't want a single valid json str for the entire column, I specifically want to convert each element to a json fragment. Both are valid use cases as I often work with jsonlines and in this case I just want to dump the entire frame to csv.

cmdlineluser commented 10 months ago

As far as I can tell, all that is needed is to add this to list.rs

https://github.com/pola-rs/polars/blob/2c5f4f336059a1876dda9ebad75f9f955b17ae5b/crates/polars-plan/src/dsl/function_expr/struct_.rs#L123-L134

and change let ca = s.struct_()?; to let ca = s.list()?;

But perhaps someone can answer if this should be implemented as Expr.json_encode() instead?

And not limited to lists/structs?

DeflateAwning commented 6 months ago

Seems to be a duplicate of #8482.

Seems like a pretty easy solution. Would be awesome to be able to use pl.col('some_list_col').list.json_encode() (and the same for the arr namespace). Can this please be implemented?