pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.21k stars 1.95k forks source link

Selecting null values from an object column makes the process unresponsive #15070

Open Sentewolf opened 7 months ago

Sentewolf commented 7 months ago

Checks

Reproducible example

import polars as pl

df = pl.from_dict(
    {
        "a": [
            [1, 2, 3],
            "b",
            None,
        ]
    }
)

print("this is fine")
sub1 = df[[0, 1]]
print(sub1)

print("this is also fine")
sub2 = df["a"][2]
print(sub2)

print("this never terminates")
sub3 = df[[2]]
print(sub3)

Log output

this is fine
shape: (2, 1)
┌───────────┐
│ a         │
│ ---       │
│ object    │
╞═══════════╡
│ [1, 2, 3] │
│ b         │
└───────────┘
this is also fine
None
this never terminates

Afterwards there is no further response from the python process.

Issue description

Selecting rows from a dataframe with null values in an object column results in an unresponsive process. Selecting the same row from the same column as a series works fine.

Null values in (for example) a string column work fine.

This problem manifests itself in many places, for example when working iterating over a group_by.

The problem seems to happen somewhere around here in frame.py:

    def _take_with_series(self, s: Series) -> DataFrame:
        return self._from_pydf(self._df.take_with_series(s._s))

Expected behavior

Preferably it just works, similar to selecting a null value from a Series with objects.

If null values in an object column are an issue, then this should at least raise an error or make it crash in a horrible way, rather than idle until the end of time.

Installed versions

``` --------Version info--------- Polars: 0.20.15 Index type: UInt32 Platform: Linux-5.15.0-1045-azure-x86_64-with-glibc2.29 Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] ----Optional dependencies---- adbc_driver_manager: cloudpickle: connectorx: deltalake: fastexcel: fsspec: 2024.2.0 gevent: hvplot: matplotlib: 3.7.1 numpy: 1.24.4 openpyxl: pandas: 2.0.3 pyarrow: 12.0.1 pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
cmdlineluser commented 7 months ago

Can reproduce.

It seems specific to the usage of ._df.take_with_series as regular methods appear to work:

df = pl.DataFrame({ "a": [ [1, 2, 3], "b", None ] }, schema = {"a": pl.Object})

df.select(pl.all().gather([2]))
# shape: (1, 1)
# ┌────────┐
# │ a      │
# │ ---    │
# │ object │
# ╞════════╡
# │ null   │
# └────────┘

This is also strange:

df[2]
# shape: (1, 1)
# ┌────────┐
# │ a      │
# │ ---    │
# │ object │
# ╞════════╡
# │ None   │ # <- What is happening here?
# └────────┘
deanm0000 commented 7 months ago

I see the not-fine examples but what do we need to do to get an unresponsive behavior?

I can do df.filter(pl.col('a').is_null()) and that works.

@cmdlineluser I think what's happening is an artifact of how arrow deals with nulls. It stores an arbitrary value and then there's a separate flag that denotes that value as really being a null. In this case it seems like it has the python None stored but it also set its own null flag. For some reason, when you slice the df then it switches from showing the null to showing the python None.

That said, I'd recommend refactoring so that you're not using Objects.

Sentewolf commented 7 months ago

@deanm0000 not sure what your question is. For me and others, the MRE I supplied results in a process that never terminates and has to be killed manually. The last output I see is this never terminates and then I can wait indefinitely. Are you saying that the supplied MRE runs properly for you and that you see the print(sub3) output?

Refactoring so that I'm not using Objects sounds nice in theory, but polars would be much more useful if it can be used with object columns as well. I fixed my actual issue by ensuring there are no null values in the data so I can safely iterate over group_by groups and now I'm still happily using object columns (with no meaningful option to refactor the object colums).

Even if it would be the policy of polars not to support Object columns, it would be more graceful if you receive an actual error, rather than a never-terminating process.

For clarity the output I do see:

this is fine
shape: (2, 1)
┌───────────┐
│ a         │
│ ---       │
│ object    │
╞═══════════╡
│ [1, 2, 3] │
│ b         │
└───────────┘
this is also fine
None
this never terminates
cmdlineluser commented 7 months ago

@deanm0000 The unresponsive behaviour:

pl.DataFrame([None], schema={"a": pl.Object})[[0]]