pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.93k stars 1.72k forks source link

`melt` panic with categories #10775

Open Wouittone opened 10 months ago

Wouittone commented 10 months ago

Checks

Reproducible example

import polars as pl
df = pl.from_records(
    [
        {"race":"road","sex":"man","2008":"Alessandro Ballan","2009":"Cadel Evans"},
        {"race":"itt","sex":"man","2008":"Bert Grabsch","2009":"Fabian Cancellara"},
        {"race":"road","sex":"woman","2008":"Nicole Cooke","2009":"Tatiana Guderzo"},
        {"race":"itt","sex":"woman","2008":"Amber Neben","2009":"Kristin Armstrong"},
    ]
)

>>> shape: (4, 4)
┌──────┬───────┬───────────────────┬───────────────────┐
│ race ┆ sex   ┆ 2008              ┆ 2009              │
│ ---  ┆ ---   ┆ ---               ┆ ---               │
│ str  ┆ str   ┆ str               ┆ str               │
╞══════╪═══════╪═══════════════════╪═══════════════════╡
│ road ┆ man   ┆ Alessandro Ballan ┆ Cadel Evans       │
│ itt  ┆ man   ┆ Bert Grabsch      ┆ Fabian Cancellara │
│ road ┆ woman ┆ Nicole Cooke      ┆ Tatiana Guderzo   │
│ itt  ┆ woman ┆ Amber Neben       ┆ Kristin Armstrong │
└──────┴───────┴───────────────────┴───────────────────┘

melt works as expected with pl.Utf8 data:

df.melt(
    id_vars=["sex", "race"],
    variable_name="year",
    value_name="winner",
)

>>> shape: (8, 4)
┌───────┬──────┬──────┬───────────────────┐
│ sex   ┆ race ┆ year ┆ winner            │
│ ---   ┆ ---  ┆ ---  ┆ ---               │
│ str   ┆ str  ┆ str  ┆ str               │
╞═══════╪══════╪══════╪═══════════════════╡
│ man   ┆ road ┆ 2008 ┆ Alessandro Ballan │
│ man   ┆ itt  ┆ 2008 ┆ Bert Grabsch      │
│ woman ┆ road ┆ 2008 ┆ Nicole Cooke      │
│ woman ┆ itt  ┆ 2008 ┆ Amber Neben       │
│ man   ┆ road ┆ 2009 ┆ Cadel Evans       │
│ man   ┆ itt  ┆ 2009 ┆ Fabian Cancellara │
│ woman ┆ road ┆ 2009 ┆ Tatiana Guderzo   │
│ woman ┆ itt  ┆ 2009 ┆ Kristin Armstrong │
└───────┴──────┴──────┴───────────────────┘

melt returns wrong values (note how 2009 values are repeated from 2008) with pl.Categorical data and the string cache disabled:

pl.enable_string_cache(False)
df.with_columns(cs.matches("\\d+").cast(pl.Categorical)) \
    .melt(
        id_vars=["sex", "race"],
        variable_name="year",
        value_name="winner",
    )
>>> shape: (8, 4)
┌───────┬──────┬──────┬───────────────────┐
│ sex   ┆ race ┆ year ┆ winner            │
│ ---   ┆ ---  ┆ ---  ┆ ---               │
│ str   ┆ str  ┆ str  ┆ cat               │
╞═══════╪══════╪══════╪═══════════════════╡
│ man   ┆ road ┆ 2008 ┆ Alessandro Ballan │
│ man   ┆ itt  ┆ 2008 ┆ Bert Grabsch      │
│ woman ┆ road ┆ 2008 ┆ Nicole Cooke      │
│ woman ┆ itt  ┆ 2008 ┆ Amber Neben       │
│ man   ┆ road ┆ 2009 ┆ Alessandro Ballan │
│ man   ┆ itt  ┆ 2009 ┆ Bert Grabsch      │
│ woman ┆ road ┆ 2009 ┆ Nicole Cooke      │
│ woman ┆ itt  ┆ 2009 ┆ Amber Neben       │
└───────┴──────┴──────┴───────────────────┘

melt panics pl.Categorical data and the string cache enabled:

pl.enable_string_cache(True)
df.with_columns(cs.matches("\\d+").cast(pl.Categorical)) \
    .melt(
        id_vars=["sex", "race"],
        variable_name="year",
        value_name="winner",
    )
>>> thread '<unnamed>' panicked at 'called `Option::unwrap()` on a `None` value', /home/runner/work/polars/polars/crates/polars-core/src/chunked_array/logical/categorical/builder.rs:112:42

---------------------------------------------------------------------------
PanicException                            Traceback (most recent call last)
Cell In[17], line 2
      1 pl.enable_string_cache(True)
----> 2 print(
      3     df
      4     .with_columns(cs.matches("\\d+").cast(pl.Categorical))
      5     .melt(
      6         id_vars=["sex", "race"],
      7         variable_name="year",
      8         value_name="winner",
      9     )
     10 )

File ~/Notebooks/Engineering/2023-08 - CodinGame/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py:1440, in DataFrame.__str__(self)
   1439 def __str__(self) -> str:
-> 1440     return self._df.as_str()

PanicException: called `Option::unwrap()` on a `None` value

Issue description

polars.DataFrame.melt panics when used with categorical data (as shown in the example).

My understanding of this issue is that it comes from the different category dictionaries somehow clashing together when the method tries to combine their values, although I did not have time to investigate any further at this point.

As a sidenote, I created this issue separately from #10075 as I did not find my issue was completely duplicating #10075 , anyone with more insight please feel free to move this one as necessary :-)

Expected behavior

I would expect melt to merge categories (provided all the columns to melt are categories) into a single, categorical series:

df.melt(
    id_vars=["sex", "race"],
    variable_name="year",
    value_name="winner",
)

>>> shape: (8, 4)
┌───────┬──────┬──────┬───────────────────┐
│ sex   ┆ race ┆ year ┆ winner            │
│ ---   ┆ ---  ┆ ---  ┆ ---               │
│ str   ┆ str  ┆ str  ┆ cat               │
╞═══════╪══════╪══════╪═══════════════════╡
│ man   ┆ road ┆ 2008 ┆ Alessandro Ballan │
│ man   ┆ itt  ┆ 2008 ┆ Bert Grabsch      │
│ woman ┆ road ┆ 2008 ┆ Nicole Cooke      │
│ woman ┆ itt  ┆ 2008 ┆ Amber Neben       │
│ man   ┆ road ┆ 2009 ┆ Cadel Evans       │
│ man   ┆ itt  ┆ 2009 ┆ Fabian Cancellara │
│ woman ┆ road ┆ 2009 ┆ Tatiana Guderzo   │
│ woman ┆ itt  ┆ 2009 ┆ Kristin Armstrong │
└───────┴──────┴──────┴───────────────────┘

Installed versions

``` --------Version info--------- Polars: 0.18.15 Index type: UInt32 Platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.37 Python: 3.11.4 | packaged by conda-forge | (main, Jun 10 2023, 18:08:17) [GCC 12.2.0] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: matplotlib: numpy: 1.25.2 pandas: 2.0.3 pyarrow: 13.0.0 pydantic: sqlalchemy: xlsx2csv: xlsxwriter: ```
Wouittone commented 10 months ago

I'd be interested to have a look at this issue in more details but I'm afraid I don't really have time for now, I'll try to come back to it in a couple of weeks if this is still relevant.

qiemem commented 7 months ago

I also ran into this in 0.19.8. The reproducible example provided here is good, but just in case its helpful, here's another, even simpler one:

pl.DataFrame(
    {
        "foo": pl.Series(["a", "b", "c"], dtype=pl.Categorical),
        "bar": pl.Series(["d", "e", "f"], dtype=pl.Categorical),
    }
).melt(value_vars=["foo", "bar"])

gives:

shape: (6, 2)
┌──────────┬───────┐
│ variable ┆ value │
│ ---      ┆ ---   │
│ str      ┆ cat   │
╞══════════╪═══════╡
│ foo      ┆ a     │
│ foo      ┆ b     │
│ foo      ┆ c     │
│ bar      ┆ a     │
│ bar      ┆ b     │
│ bar      ┆ c     │
└──────────┴───────┘

Note that the it appears to interpreted the numeric representations of the categorical values of bar with the string representations of foo, so that bar now also uses a, b, and c.

with pl.enable_string_cache(), you get:

thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-core/src/chunked_array/logical/categorical/builder.rs:114:42:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/bryan/mambaforge/envs/py11/lib/python3.11/site-packages/polars/dataframe/frame.py", line 1506, in __repr__
    return self.__str__()
           ^^^^^^^^^^^^^^
  File "/home/bryan/mambaforge/envs/py11/lib/python3.11/site-packages/polars/dataframe/frame.py", line 1503, in __str__
    return self._df.as_str()
           ^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value

That unwrap panic just on the global cache lookup: https://github.com/pola-rs/polars/blob/891b586b84c5bfbd6af7abd28c34c2a8a1ab5f58/crates/polars-core/src/chunked_array/logical/categorical/builder.rs#L114

coastalwhite commented 3 weeks ago

This does not reproduce anymore.

import polars as pl

print(pl.DataFrame(
    {
        "foo": pl.Series(["a", "b", "c"], dtype=pl.Categorical),
        "bar": pl.Series(["d", "e", "f"], dtype=pl.Categorical),
    }
).melt(value_vars=["foo", "bar"]))

Gives output:

shape: (6, 2)
┌──────────┬───────┐
│ variable ┆ value │
│ ---      ┆ ---   │
│ str      ┆ cat   │
╞══════════╪═══════╡
│ foo      ┆ a     │
│ foo      ┆ b     │
│ foo      ┆ c     │
│ bar      ┆ a     │
│ bar      ┆ b     │
│ bar      ┆ c     │
└──────────┴───────┘

Closing.

Wouittone commented 2 weeks ago

@coastalwhite I do reproduce when using the string cache, both in 0.20.31 and in 1.0 beta.

with pl.StringCache():
    print(
        pl.DataFrame(
            {
                "foo": pl.Series(["a", "b", "c"], dtype=pl.Categorical),
                "bar": pl.Series(["d", "e", "f"], dtype=pl.Categorical),
            }
        )
        .melt(value_vars=["foo", "bar"])
    )

still panicks on my setup

.../lib/python3.12/site-packages/polars/dataframe/frame.py in ?(self)
   1104     def __str__(self) -> str:
-> 1105         return self._df.as_str()

PanicException: called `Option::unwrap()` on a `None` value

I did start working on this issue however it is not yet complete; would a PR be accepted once I get some time to work it out?