pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.14k stars 1.83k forks source link

Introduce a validity buffer for `Struct` columns #3462

Open cbilot opened 2 years ago

cbilot commented 2 years ago

polars 0.13.37+ (compiled through edc4dbd) python 3.10.4 Linux Mint 20.3

Describe your bug.

I would argue that the following two lists are not the same:

1. [{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]
2. [{'a': 1, 'b': 10}, {'a': None, 'b': None}, {'a': 3, 'b': 30}]

I would also argue that a Series based on these two lists should not be the same. However, Polars currently conflates these two ideas. Either list produces the same Series.

None as an element in a Series

Perhaps the easiest way to see this is to perform a round trip: from List of dictionaries --> Series of struct --> List of dictionaries.

import polars as pl
_values = [
    {"a": 1, "b": 10},
    None,
    {"a": 3, "b": 30},
]
>>> _values
[{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]

>>> pl.Series(values=_values)
shape: (3,)
Series: '' [struct[2]]
[
        {1,10}
        {null,null}
        {3,30}
]

>>> pl.Series(values=_values).to_list()
[{'a': 1, 'b': 10}, {'a': None, 'b': None}, {'a': 3, 'b': 30}]

I would argue that the round-trip should have produced:

>>> pl.Series(values=_values)
shape: (3,)
Series: '' [struct[2]]
[
        {1,10}
        null
        {3,30}
]

>>> pl.Series(values=_values).to_list()
[{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]

Structs with all fields set to None

By contrast, a round-trip for a dictionary with all values set to None preserves the values.

_values = [
    {"a": None, "b": None},
]
>>> _values
[{'a': None, 'b': None}]

>>> pl.Series(values=_values)
shape: (1,)
Series: '' [struct[2]]
[
        {null,null}
]

>>> pl.Series(values=_values).to_list()
[{'a': None, 'b': None}]

is_null returns true for a struct element with all fields set to None

Polars appears to treat a struct with all fields set to None as a None element in a Series.

_values = [
    {"a": None, "b": None},
]
pl.Series(values=_values).is_null()
shape: (1,)
Series: 'a' [bool]
[
        true
]

I would argue that the result should have been False: a struct with all fields set to None is not the same as None as an element of a Series.

Setting a Series element to None.

Polars currently does not allow setting a Series element to None for a Series[struct].

s = pl.Series(values=[{"a": 1, "b": 10}])
>>> s
shape: (1,)
Series: '' [struct[2]]
[
        {1,10}
]

>>> s.set_at_idx(0, None)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/mnt/projects/polars/py-polars/polars/internals/series.py", line 2287, in set_at_idx
    raise ValueError(
ValueError: could not find the FFI function needed to set at idx for series <builtins.PySeries object at 0x7f27fa753780>

Other Notes

The above may also be applicable to Lists[struct] (as opposed to Series[struct]).

cbilot commented 2 years ago

I should probably add ... I myself am not convinced that these two Series elements are (necessarily) different in all problem domains:

  1. None
  2. {'survival': None, 'recovery_percent': None}

But might they be in some problem domains or data models? For example, the first may represent that an experiment was not performed (or not applicable); the second may represent that an experiment was performed, but that the information is not yet available.

Regardless, if the goal is to support highly recursive structures from various sources and various problem domains, Polars may need the ability to maintain the distinction between the two concepts: a Series element of None (as is done with Series of other data types) versus a struct with all fields set to None.

ritchie46 commented 2 years ago

I need to think about this one a bit. This will needs some refactors on the way we have designed the Struct logical type. When I designed it I decided that we simply use the fields null values and don't set a validity on the Struct itself.

I think we must enforce that idea. And this

[{'a': 1, 'b': 10}, None, {'a': 3, 'b': 30}]

should fail as it should be constructed as follows:

[{'a': 1, 'b': 10}, {'a': None, 'b': None}, {'a': 3, 'b': 30}]

Then we could later decide if we need validity on structs as well.

0x26res commented 1 year ago

Hey, I have similar issues.

I would like to point out that the behaviour of polars diverges from arrow on this point, and @cbilot suggestion would make the two converge.

So in arrow:

right = pa.StructArray.from_arrays( [pa.array(['Valid', None])], names=['field1'], mask=pa.array([False, True]) ) assert left == right

- This may feel weird but it is useful when you take nullability in consideration. You may have a nullable top level struct whose child are not nullable. In which case you'd want to set the non-nullable child fields of null top level struct to default values, and the nullability constraint would still be valid.
```python
with_nullable = pa.StructArray.from_arrays(
    [pa.array(['Valid', ''])],
    fields=[pa.field('field1', pa.string(), nullable=False)],
    mask=pa.array([False, True])
)

Vs this, which is not valid because the field1 is not nullable:

with_nullable = pa.StructArray.from_arrays(
    [pa.array(['Valid', None])],
    fields=[pa.field('field1', pa.string(), nullable=False)],
    mask=pa.array([False, True])
)

So when going from arrow to polar and back some information is lost because of these semantic differences Here's an example illustrating it:

import pyarrow as pa
import pyarrow.compute as pc
import polars as pl

field1_values = pa.array(['1', ''])
field2_values = pa.array([1, 0])

struct_array = pa.StructArray.from_arrays(
    [field1_values, field2_values],
    names=['field1', 'field2'],
    mask=pa.array([False, True])
)

table = pa.table({"col1": struct_array})

assert table['col1'].combine_chunks().field(0).to_pylist() == ['1', ""], "raw nested value is Preserved, ignoring mask"
assert table['col1'].combine_chunks().field(1).to_pylist() == [1, 0],  "raw nested value is Preserved, ignoring mask"
assert pc.struct_field(table['col1'], 0).to_pylist() == ['1', None], "struct_field overlay the mask on the raw value"
assert pc.struct_field(table['col1'], 1).to_pylist() == [1, None], "struct_field overlay the mask on the raw value"

df = pl.from_arrow(table)

assert df['col1'].is_null().to_list() == [False, True] == table['col1'].is_null().to_pylist(), "As expected"

table_from_polars = df.to_arrow()

assert table_from_polars['col1'].combine_chunks().field(0).to_pylist() == ['1', None], "Should be [1, '']"
assert table_from_polars['col1'].combine_chunks().field(1).to_pylist() == [1, None], "Should be [1, 0]"
assert table_from_polars['col1'].is_null().to_pylist() == [False, False], "Should be [True, False]"
stinodego commented 5 months ago

We could solve this by adding a validity buffer to Struct columns to fix this issue. However, this is not a simple change at all. We're not sure if that is the way to go yet.

It currently does not have priority for us to implement this.

NickCrews commented 3 months ago

I just ran into this when implementing a PR in Ibis

Not just arrow, but other backends such as duckdb, flink, and bigquery all have nullable structs. Polars is definitely the odd one out here. This change would make polars consistent with the rest of the ecosystem. I probably can't do the brunt of the work here, but happy to give a review! Thank you!

coastalwhite commented 3 months ago

Just as a note for later. If we end up implementing this, we should test the parquet writer/reader to also make sure that this roundtrips.