pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
30.22k stars 1.95k forks source link

Unable to melt categorical-type columns with other types #12231

Open jr200 opened 1 year ago

jr200 commented 1 year ago

Checks

Reproducible example

import polars as pl
import os

os.environ["POLARS_VERBOSE"] = "1"

(pl
 .DataFrame({
    "k":[1,2],
    "a":[True, False],
    "b":["x","y"],
    })
  .with_columns(pl.col("b").cast(pl.Categorical))
  .melt(id_vars="k", value_vars=["a", "b"])
)

Log output

---------------------------------------------------------------------------
ComputeError                              Traceback (most recent call last)
/var/folders/ds/8dkzpd2s63n0__x_fqgw449h0000gn/T/ipykernel_6408/114942091.py in ?()
      9     "a":[True, False],
     10     "b":["x","y"],
     11     })
     12   .with_columns(pl.col("b").cast(pl.Categorical))
---> 13   .melt(id_vars="k", value_vars=["a", "b"])
     14 )

~/code/scratch/.venv/lib/python3.11/site-packages/polars/dataframe/frame.py in ?(self, id_vars, value_vars, variable_name, value_name)
   7145         value_vars = [] if value_vars is None else _expand_selectors(self, value_vars)
   7146         id_vars = [] if id_vars is None else _expand_selectors(self, id_vars)
   7147 
   7148         return self._from_pydf(
-> 7149             self._df.melt(id_vars, value_vars, value_name, variable_name)
   7150         )

ComputeError: failed to determine supertype of bool and cat

Issue description

This might be expected behaviour, but I wanted to check.

Unable to perform a melt between:

Is it reasonable to say the super-type of these combinations are str? Or cat even?

Expected behavior

import polars as pl
import os

os.environ["POLARS_VERBOSE"] = "1"

(pl
 .DataFrame({
    "k":[1,2],
    "a":[True, False],
    "b":["x","y"],
    })
  .with_columns(pl.col("b").cast(pl.Categorical))
  .with_columns(pl.col("b").cast(pl.Utf8))
  .melt(id_vars="k", value_vars=["a", "b"])
)

Output:

shape: (4, 3)
┌─────┬──────────┬───────┐
│ k   ┆ variable ┆ value │
│ --- ┆ ---      ┆ ---   │
│ i64 ┆ str      ┆ str   │
╞═════╪══════════╪═══════╡
│ 1   ┆ a        ┆ true  │
│ 2   ┆ a        ┆ false │
│ 1   ┆ b        ┆ x     │
│ 2   ┆ b        ┆ y     │
└─────┴──────────┴───────┘

Installed versions

``` --------Version info--------- Polars: 0.19.12 Index type: UInt32 Platform: macOS-13.6-x86_64-i386-64bit Python: 3.11.2 (main, Mar 18 2023, 23:16:11) [Clang 14.0.0 (clang-1400.0.29.202)] ----Optional dependencies---- adbc_driver_sqlite: cloudpickle: connectorx: deltalake: fsspec: gevent: matplotlib: numpy: openpyxl: pandas: pyarrow: pydantic: pyiceberg: pyxlsb: sqlalchemy: xlsx2csv: xlsxwriter: ```
Wainberg commented 9 months ago

This also applies to Enums, and to transpose():

>>> pl.DataFrame({'a': [1], 'b': ['foo']}).transpose().dtypes
[String]
>>> pl.DataFrame({'a': [1], 'b': pl.Series(['foo'], dtype=pl.Categorical)}).transpose().dtypes
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "polars/dataframe/frame.py", line 4033, in transpose
    return self._from_pydf(self._df.transpose(keep_names_as, column_names))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: failed to determine supertype of i64 and cat
>>> pl.DataFrame({'a': [1], 'b': pl.Series(['foo'], dtype=pl.Enum(['foo']))}).transpose().dtypes
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "polars/dataframe/frame.py", line 4033, in transpose
    return self._from_pydf(self._df.transpose(keep_names_as, column_names))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
polars.exceptions.ComputeError: failed to determine supertype of i64 and enum
s-banach commented 8 months ago

At least in the case of enum and cat, we could say cat is the supertype?

deanm0000 commented 8 months ago

@s-banach seemingly, but the new cat wouldn't necessarily have the same encodings/StringCache (not sure what term is right here) as the original so could be problematic. Here's a bug actually:

df=pl.DataFrame({'a':pl.Series(['apple', 'banana', 'carrot'], dtype=pl.Enum(['apple', 'banana', 'carrot'])),
                 'b':pl.Series(['planes','trains','automobiles'], dtype=pl.Categorical)})
shape: (3, 2)
┌────────┬─────────────┐
│ a      ┆ b           │
│ ---    ┆ ---         │
│ enum   ┆ cat         │
╞════════╪═════════════╡
│ apple  ┆ planes      │
│ banana ┆ trains      │
│ carrot ┆ automobiles │
└────────┴─────────────┘
df.with_columns(pl.col('a').cast(pl.Categorical)).melt()
shape: (6, 2)
┌──────────┬────────┐
│ variable ┆ value  │
│ ---      ┆ ---    │
│ str      ┆ cat    │
╞══════════╪════════╡
│ a        ┆ apple  │
│ a        ┆ banana │
│ a        ┆ carrot │
│ b        ┆ apple  │
│ b        ┆ banana │
│ b        ┆ carrot │
└──────────┴────────┘

So it needs to remap categoricals when melted (or at least disallow and raise)