pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.16k stars 1.94k forks source link

Support empty structs #9216

Open stinodego opened 1 year ago

stinodego commented 1 year ago

Problem description

Although perhaps not extremely useful, we should allow structs without any fields for the sake of consistency.

In the current behaviour, Polars conjures up a single unnamed field of type Null:

>>> pl.Series(dtype=pl.Struct())
shape: (1,)
Series: '' [struct[1]]
[
        {null}
]

Trying to create an empty struct through the struct expression results in a PanicException:

>>> pl.select(pl.struct())
thread '<unnamed>' panicked at 'index out of bounds: the len is 0 but the index is 0', /home/stijn/code/polars/polars/polars-lazy/polars-plan/src/dsl/functions.rs:1296:48
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/stijn/code/polars/py-polars/polars/functions/lazy.py", line 2391, in select
    return pl.DataFrame().select(exprs, *more_exprs, **named_exprs)
  File "/home/stijn/code/polars/py-polars/polars/dataframe/frame.py", line 7117, in select
    self.lazy()
  File "/home/stijn/code/polars/py-polars/polars/lazyframe/frame.py", line 2040, in select
    return self._from_pyldf(self._ldf.select(exprs))
pyo3_runtime.PanicException: index out of bounds: the len is 0 but the index is 0

Desired behaviour would be:

>>> pl.Series(dtype=pl.Struct())
shape: (0,)
Series: '' [struct[0]]
[
]
>>> pl.select(pl.struct())
shape: (1, 1)
┌───────────┐
│ struct    │
│ ---       │
│ struct[0] │
╞═══════════╡
│ {}        │
└───────────┘
sibarras commented 7 months ago

Hi, does the team have a plan to support this? In a lot of cases, when parsing empty json columns from DB, the function panics.

stinodego commented 7 months ago

Hi, does the team have a plan to support this? In a lot of cases, when parsing empty json columns from DB, the function panics.

@sibarras Could you give a reproducible example of that panic?

sibarras commented 7 months ago

Hi, does the team have a plan to support this? In a lot of cases, when parsing empty json columns from DB, the function panics.

@sibarras Could you give a reproducible example of that panic?

Sure, using sqlite, when you read a json column, it gets parsed as a str on polars. Then when you try to cast this to a struct, we got a panic.

from sqlite3 import connect
import polars as pl

def main():
    with connect(":memory:") as con:
        df = pl.read_database(
            "SELECT JSON('{}') as json_col;", con
        )  # it works fine, but it's parsed as a string
        print(df)
        df.select(pl.col("json_col").str.json_decode())  # panics here

if __name__ == "__main__":
    main()

This is the output using Python 3.9.18 on WSL2.

shape: (1, 1)
┌──────────┐
│ json_col │
│ ---      │
│ str      │
╞══════════╡
│ {}       │
└──────────┘
thread 'python' panicked at crates/polars-arrow/src/array/struct_/mod.rs:117:52:
called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("a StructArray must contain at least one field"))
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/home/samuel_e_ibarra/coding/python/targeting_lib/example.py", line 15, in <module>
    main()
  File "/home/samuel_e_ibarra/coding/python/targeting_lib/example.py", line 11, in main
    df.select(pl.col("json_col").str.json_decode())  # panics here
  File "/home/samuel_e_ibarra/coding/python/targeting_lib/.venv/lib/python3.9/site-packages/polars/dataframe/frame.py", line 8124, in select
    return self.lazy().select(*exprs, **named_exprs).collect(_eager=True)
  File "/home/samuel_e_ibarra/coding/python/targeting_lib/.venv/lib/python3.9/site-packages/polars/lazyframe/frame.py", line 1943, in collect
    return wrap_df(ldf.collect())
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(ErrString("a StructArray must contain at least one field"))
stinodego commented 6 months ago

I looked into this and empty structs just don't really make much sense. An empty struct column would have to behave somewhat like a Null column as it doesn't contain any Series/values.

We should probably first address https://github.com/pola-rs/polars/issues/3462 before implementing this.

str.json_decode should either error or return a Null column here. I will make a separate issue for that.

jcmuel commented 5 months ago

The empty struct also creates issues in read_ndjson and json_decode:

Polars already handles empty structs, but in an inconsistent way. And the inconsistency causes panic exceptions in more complex situations.

import io
import polars as pl

frame = pl.read_ndjson(io.StringIO('{"id": 1, "empty_struct": {}, "list_of_empty_struct": [{}]}'))
print(frame)

for col_name, col_type in frame.schema.items():
    print(f'{col_name:>20}   {col_type}')

Output:

shape: (1, 3)
┌─────┬──────────────┬──────────────────────┐
│ id  ┆ empty_struct ┆ list_of_empty_struct │
│ --- ┆ ---          ┆ ---                  │
│ i64 ┆ struct[1]    ┆ list[struct[0]]      │
╞═════╪══════════════╪══════════════════════╡
│ 1   ┆ {null}       ┆ []                   │
└─────┴──────────────┴──────────────────────┘
                  id   Int64
        empty_struct   Struct({'': Null})
list_of_empty_struct   List(Struct({}))

The expected type of the "empty_struct" column would be pl.Struct({}), but it is pl.Struct({pl.Field('', pl.Null)}).

SampatPenugonda commented 2 months ago

I have requirement to create a empty struct in the dataFrame and later i would like to add / rename the fields using struct.with_fields.

But , i was not able to create a empty struct, when i try to create like : pl.struct([]) it is like empty literal.

Any recent approaches ?

cmdlineluser commented 2 months ago

After https://github.com/pola-rs/polars/pull/18249 we now get:

>>> pl.Series(dtype=pl.Struct)
shape: (0,)
Series: '' [struct[1]]
[
]

Although it produces struct[1] instead of struct[0] which I'm not sure about.

>>> pl.Series(dtype=pl.Struct).dtype
Struct({'': Null})