Open Wouittone opened 10 months ago
I'd be interested to have a look at this issue in more details but I'm afraid I don't really have time for now, I'll try to come back to it in a couple of weeks if this is still relevant.
I also ran into this in 0.19.8. The reproducible example provided here is good, but just in case its helpful, here's another, even simpler one:
pl.DataFrame(
{
"foo": pl.Series(["a", "b", "c"], dtype=pl.Categorical),
"bar": pl.Series(["d", "e", "f"], dtype=pl.Categorical),
}
).melt(value_vars=["foo", "bar"])
gives:
shape: (6, 2)
┌──────────┬───────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ cat │
╞══════════╪═══════╡
│ foo ┆ a │
│ foo ┆ b │
│ foo ┆ c │
│ bar ┆ a │
│ bar ┆ b │
│ bar ┆ c │
└──────────┴───────┘
Note that the it appears to interpreted the numeric representations of the categorical values of bar
with the string representations of foo
, so that bar
now also uses a
, b
, and c
.
with pl.enable_string_cache()
, you get:
thread '<unnamed>' panicked at /home/runner/work/polars/polars/crates/polars-core/src/chunked_array/logical/categorical/builder.rs:114:42:
called `Option::unwrap()` on a `None` value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/bryan/mambaforge/envs/py11/lib/python3.11/site-packages/polars/dataframe/frame.py", line 1506, in __repr__
return self.__str__()
^^^^^^^^^^^^^^
File "/home/bryan/mambaforge/envs/py11/lib/python3.11/site-packages/polars/dataframe/frame.py", line 1503, in __str__
return self._df.as_str()
^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Option::unwrap()` on a `None` value
That unwrap panic just on the global cache lookup: https://github.com/pola-rs/polars/blob/891b586b84c5bfbd6af7abd28c34c2a8a1ab5f58/crates/polars-core/src/chunked_array/logical/categorical/builder.rs#L114
This does not reproduce anymore.
import polars as pl
print(pl.DataFrame(
{
"foo": pl.Series(["a", "b", "c"], dtype=pl.Categorical),
"bar": pl.Series(["d", "e", "f"], dtype=pl.Categorical),
}
).melt(value_vars=["foo", "bar"]))
Gives output:
shape: (6, 2)
┌──────────┬───────┐
│ variable ┆ value │
│ --- ┆ --- │
│ str ┆ cat │
╞══════════╪═══════╡
│ foo ┆ a │
│ foo ┆ b │
│ foo ┆ c │
│ bar ┆ a │
│ bar ┆ b │
│ bar ┆ c │
└──────────┴───────┘
Closing.
@coastalwhite I do reproduce when using the string cache, both in 0.20.31 and in 1.0 beta.
with pl.StringCache():
print(
pl.DataFrame(
{
"foo": pl.Series(["a", "b", "c"], dtype=pl.Categorical),
"bar": pl.Series(["d", "e", "f"], dtype=pl.Categorical),
}
)
.melt(value_vars=["foo", "bar"])
)
still panicks on my setup
.../lib/python3.12/site-packages/polars/dataframe/frame.py in ?(self)
1104 def __str__(self) -> str:
-> 1105 return self._df.as_str()
PanicException: called `Option::unwrap()` on a `None` value
I did start working on this issue however it is not yet complete; would a PR be accepted once I get some time to work it out?
Checks
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of Polars.
Reproducible example
melt
works as expected withpl.Utf8
data:melt
returns wrong values (note how 2009 values are repeated from 2008) withpl.Categorical
data and the string cache disabled:melt
panicspl.Categorical
data and the string cache enabled:Issue description
polars.DataFrame.melt
panics when used with categorical data (as shown in the example).My understanding of this issue is that it comes from the different category dictionaries somehow clashing together when the method tries to combine their values, although I did not have time to investigate any further at this point.
As a sidenote, I created this issue separately from #10075 as I did not find my issue was completely duplicating #10075 , anyone with more insight please feel free to move this one as necessary :-)
Expected behavior
I would expect
melt
to merge categories (provided all the columns to melt are categories) into a single, categorical series:Installed versions