parrt / dtreeviz

A python library for decision tree visualization and model interpretation.
MIT License
2.94k stars 331 forks source link

dtype category is not working with lightgbm (check the other libraries also) #267

Open tlapusan opened 1 year ago

tlapusan commented 1 year ago

When we are using this type of feature preprocessing 'dataset["Sex"] = dataset.Sex.astype("category")', the dataset will contain the string value, like 'male', but lightgbm will convert it to its int representation, like '1'.

When dtreeviz is using the prediction path to search the path through the tree for a sample, where will be a mismatch of values, like 'is "male" in [1]?'. This will cause the node_samples to have wrong samples and make the view() to fail.

You can reproduce the issue by using this dataset for training.

dataset_url = "https://raw.githubusercontent.com/parrt/dtreeviz/master/data/titanic/titanic.csv"
dataset = pd.read_csv(dataset_url)

dataset.fillna({"Age":dataset.Age.mean()}, inplace=True)
dataset["Sex"] = dataset.Sex.astype("category")#.cat.codes
dataset["Cabin"] = dataset.Cabin.astype("category").cat.codes
dataset.fillna({"Embarked":"?"}, inplace=True)
dataset["Embarked"] = dataset.Embarked.astype("category")#.cat.codes
print(dataset.dtypes)