shakedzy / dython

A set of data tools in Python
http://shakedzy.xyz/dython/
MIT License
496 stars 102 forks source link

dython.nominal.associations handling fillna with dtype="category" #140

Closed enrir closed 1 year ago

enrir commented 2 years ago

When the default strategy 'replace' is selected, associations raises the following TypeError: Cannot setitem on a Categorical with a new category (0.0), set the categories first if the input dataset is a Pandas DataFrame with some columns with dtype="category".

See code below.

from dython.nominal import associations
import pandas as pd
import numpy as np

df = pd.DataFrame({
    "A": ["a", "b", "c", "a", np.nan],
    "B": [0.0, 2.0, 1.0, 0.5, np.nan]
})

associations(df) # no expection
df["A"] = df["A"].astype("category")
associations(df) # raise expection
df["A"] = df["A"].cat.add_categories(0.0)
associations(df) # no expection

The problem is related to pandas fillna behaviour, see this stackoverflow question.

Given that the default strategy 'replace' with value 0.0, I'm wondering if this case can be handled internally by the associations method or if this is a corner case. Often, the category dtype is used when memory efficiency is important and switching dtype is expensive.

shakedzy commented 2 years ago

Hey @enrir - thanks for brining this up. This isn't a desired behavior. I'll need to dive deeper into this, as I'm not sure I understand what actually happens here and what's the best way to handle it.

enrir commented 2 years ago

Hi @shakedzy, I forked the repo and started to write a possible fix. If it’s ok, I will open a pr when I have something ready. 😊

shakedzy commented 2 years ago

That's perfect :) thanks!