Open david-waterworth opened 10 months ago
Yeah, I think I mentioned it at the time but may have gotten lost in translation.
Essentially the equivalent of:
df = pl.DataFrame({
"a": [["1", "2"], ["3", "4"]],
"b": [[dict(B=1)], [dict(B=2)]],
"c": [dict(C=3), dict(C=4)],
"d": [datetime.date.today(), None],
"e": [5, 6]
})
# shape: (2, 5)
# ┌────────────┬─────────────────┬───────────┬────────────┬─────┐
# │ a ┆ b ┆ c ┆ d ┆ e │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ list[str] ┆ list[struct[1]] ┆ struct[1] ┆ date ┆ i64 │
# ╞════════════╪═════════════════╪═══════════╪════════════╪═════╡
# │ ["1", "2"] ┆ [{1}] ┆ {3} ┆ 2024-01-27 ┆ 5 │
# │ ["3", "4"] ┆ [{2}] ┆ {4} ┆ null ┆ 6 │
# └────────────┴─────────────────┴───────────┴────────────┴─────┘
duckdb.sql("""
from df
select a::json, b::json, c::json, d::json, e::json
""").pl()
# shape: (2, 5)
# ┌───────────────────┬───────────────────┬───────────────────┬───────────────────┬───────────────────┐
# │ CAST(a AS "json") ┆ CAST(b AS "json") ┆ CAST(c AS "json") ┆ CAST(d AS "json") ┆ CAST(e AS "json") │
# │ --- ┆ --- ┆ --- ┆ --- ┆ --- │
# │ str ┆ str ┆ str ┆ str ┆ str │
# ╞═══════════════════╪═══════════════════╪═══════════════════╪═══════════════════╪═══════════════════╡
# │ ["1","2"] ┆ [{"B":1}] ┆ {"C":3} ┆ "2024-01-27" ┆ 5 │
# │ ["3","4"] ┆ [{"B":2}] ┆ {"C":4} ┆ null ┆ 6 │
# └───────────────────┴───────────────────┴───────────────────┴───────────────────┴───────────────────┘
(decoding is .str.json_decode()
- it's not struct specific)
(I'm assuming .map_elements
requires explicit conversion to list by the user for performance reasons.)
Yeah after I posted the example I realised I also need to cast/encode a list[struct[*]]
as well.
map_elements
doesn't work b/c to make that into json it needs the whole column at once and map_elements
works on each item at a time.
Instead use map_batches
like this:
df.select(pl.col('b').map_batches(lambda x: json.dumps(x.to_list())))
@deanm0000 thanks for the suggestion but I actually want it to process one item at a time, I don't want a single valid json str for the entire column, I specifically want to convert each element to a json fragment. Both are valid use cases as I often work with jsonlines and in this case I just want to dump the entire frame to csv.
As far as I can tell, all that is needed is to add this to list.rs
and change let ca = s.struct_()?;
to let ca = s.list()?;
But perhaps someone can answer if this should be implemented as Expr.json_encode()
instead?
And not limited to lists/structs?
Seems to be a duplicate of #8482.
Seems like a pretty easy solution. Would be awesome to be able to use pl.col('some_list_col').list.json_encode()
(and the same for the arr
namespace). Can this please be implemented?
Description
pl.Struct
contains ajson_encode
/json_decode
which amongst other things is useful if you want to dump a table to csv but it contains nested fields.It would be useful if
pl.List
also contains at leastjson_encode
(json_decode
might be trickier as it would have to check that the json was a list).At the moment my workaround is
I considered using
to_struct
but that doesn't preserve the list structure.Also the error from my original attempt
ComputeError: TypeError: Object of type Series is not JSON serializable
I guess internally a list is a series but that was surprising, Ideally the engine would internally cast to a python list before invoking the function being used by map_elements?