Display labels of categorical features in split nodes

pplonski commented 4 years ago

Sklearn decision tree doesn't work with categorical values. Before using decision tree, I'm converting categoricals, for example to integers (with LabelEncoder). In the tree visualization, there is presented converted value. Is there an option to better handle categoricals?

parrt commented 4 years ago

Label-encoding categorical variables for decision trees or random forests is the easiest and most correct solution. That means converting every category level to a unique integer so it sounds like you're doing the right thing. If you pass in the list of category names that should display that in the trees.

pplonski commented 4 years ago

I was trying to replace columns Categorical <-> Integer but I cant make it work.

Here is the example:

import pandas as pd
from dtreeviz.trees import *
# example data set
df = pd.DataFrame({"feature_1": ["a","a","a","a","a","b","b","b","b","b"],
                      "feature_2": [0,0,0,0,1,0,1,1,1,1],
                      "target": [0,0,0,0,0,1,1,1,1,1]})
# apply categorical conversion
df["feature_1_converted"] = [0,0,0,0,0,1,1,1,1,1]
# train the tree with converted feature
classifier = tree.DecisionTreeClassifier(max_depth=1)   
classifier.fit(df[["feature_1_converted", "feature_2"]], df["target"])
# try to plot the tree with original features
viz = dtreeviz(classifier, 
               df[["feature_1", "feature_2"]], 
               df["target"],
               target_name='target',
               feature_names=["feature_1", "feature_2"], 
               class_names=["0", "1"])

I got error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-26-61889aa7e363> in <module>
      4                target_name='target',
      5               feature_names=["feature_1", "feature_2"],
----> 6                class_names=["0", "1"]
      7               )  
      8 

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/trees.py in dtreeviz(tree_model, X_train, y_train, feature_names, target_name, class_names, precision, orientation, instance_orientation, show_root_edge_labels, show_node_labels, show_just_path, fancy, histtype, highlight_path, X, max_X_features_LR, max_X_features_TD, label_fontsize, ticks_fontsize, fontname, colors, scale)
    778 
    779     shadow_tree = ShadowDecTree(tree_model, X_train, y_train,
--> 780                                 feature_names=feature_names, class_names=class_names)
    781 
    782     if X is not None:

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/shadow.py in __init__(self, tree_model, X_train, y_train, feature_names, class_names)
     58             y_train = y_train.values
     59         self.y_train = y_train
---> 60         self.node_to_samples = ShadowDecTree.node_samples(tree_model, X_train)
     61         if self.isclassifier():
     62             self.unique_target_values = np.unique(y_train)

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/shadow.py in node_samples(tree_model, data)
    198         # Doc say: "Return a node indicator matrix where non zero elements
    199         #           indicates that the samples goes through the nodes."
--> 200         dec_paths = tree_model.decision_path(data)
    201 
    202         # each sample has path taken down tree

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/tree/_classes.py in decision_path(self, X, check_input)
    495             indicates that the samples goes through the nodes.
    496         """
--> 497         X = self._validate_X_predict(X, check_input)
    498         return self.tree_.decision_path(X)
    499 

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/tree/_classes.py in _validate_X_predict(self, X, check_input)
    378         """Validate X whenever one tries to predict, apply, predict_proba"""
    379         if check_input:
--> 380             X = check_array(X, dtype=DTYPE, accept_sparse="csr")
    381             if issparse(X) and (X.indices.dtype != np.intc or
    382                                 X.indptr.dtype != np.intc):

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    529                     array = array.astype(dtype, casting="unsafe", copy=False)
    530                 else:
--> 531                     array = np.asarray(array, order=order, dtype=dtype)
    532             except ComplexWarning:
    533                 raise ValueError("Complex data not supported\n"

~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
    536 
    537     """
--> 538     return array(a, dtype, copy=False, order=order)
    539 
    540 

ValueError: could not convert string to float: 'a'

parrt commented 4 years ago

Can you use the LabelEncoder()? Or do it manually via https://github.com/parrt/stratx/blob/master/notebooks/support.py

def df_string_to_cat(df:pd.DataFrame) -> dict:
    catencoders = {}
    for colname in df.columns:
        if is_string_dtype(df[colname]) or is_object_dtype(df[colname]):
            df[colname] = df[colname].astype('category').cat.as_ordered()
            catencoders[colname] = df[colname].cat.categories
    return catencoders

def df_cat_to_catcode(df):
    for col in df.columns:
        if is_categorical_dtype(df[col]):
            df[col] = df[col].cat.codes + 1

parrt commented 4 years ago

ooh! you can't use

df[["feature_1", "feature_2"]],

gotta use encoded feature 1

parrt commented 4 years ago

I do it all the time. have you looked at the examples?

parrt commented 4 years ago

@tlapusan did we break something? Can you take a look and help @pplonski ?

tlapusan commented 4 years ago

Hi @parrt, sure, I will take a look soon, I hope that today

On Tue, Apr 14, 2020, 23:05 Terence Parr notifications@github.com wrote:

@tlapusan https://github.com/tlapusan did we break something? Can you take a look and help @pplonski https://github.com/pplonski ?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/parrt/dtreeviz/issues/86#issuecomment-613653673, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBYWNXH33ZNRWM2U2O5473RMS6WZANCNFSM4MHY5O5Q .

tlapusan commented 4 years ago

@tlapusan did we break something? Can you take a look and help @pplonski ?

@parrt we did't break anything ;)

@pplonski it was very helpful that you have send your code, it helped a lot for debugging . The issue is that you've trained the model using df[["feature_1_converted", "feature_2"]] and df[["feature_1", "feature_2"]] to call dtreeviz() method. You need to have the same set of columns.

Please let a comment if it's working for you now.

pplonski commented 4 years ago

Do you think, is it possible to pass original categorical values, to be printed in the output tree? I would like to see 'a' and 'b' in feature_1 in the plot.

pplonski commented 4 years ago

The expected output tree:

parrt commented 4 years ago

Hang on. you're not talking about the target. ok, let me look.

parrt commented 4 years ago

@tlapusan ha! We don't have an example where the tree nodes are cat vars! We should think about this. Nonetheless, you gotta pass in encoded vars to the classifier. We just need a split cat node example that shows how to get labels. Here's an example with a catvar split node, such as ProductID. Not sure we can label those when there are so many. maybe we just label the cats on either side of split point?

pplonski commented 4 years ago

Is there an option to pass encoding as an argument?

tlapusan commented 4 years ago

@parrt we do have example for categorical nodes, but not in the readme page. we have them here, on titanic dataset (cabin_label feature): https://github.com/parrt/dtreeviz/blob/master/notebooks/tree_structure_example.ipynb.

Would be nice and helpful to create a github wiki, to document the library even better. Putting everything in readme is kind of hard to follow and browse, especially when library will contain event more vizualisations :)

Right, if the categorical variable has a high cardinality, it's gonna be very hard to display their raw labels...and maybe is even more confusing to do so. But, yes, we need to see and discuss on a more concrete example to see how it looks.

Only in the case of categorical ordinal features would make more sense to display raw values. But I don't know an automatic way to detect encoded ordinal features. There are many ways to encode categorical variables, and to implement specific code for all of them...don't know if it is worth it.

@pplonski what's the cardinality of your categorical features ?

chenhajaj commented 4 years ago

Hi, I'm also stuck at the same place. We need to use some categorical features in the tree.

parrt commented 4 years ago

Hi @chenhajaj Cats are allowed but it shows their unique cat code at moment. So, we just need a way to indicate cat label but what if there are 10,000 labels?

chenhajaj commented 4 years ago

Hi @chenhajaj Cats are allowed but it shows their unique cat code at moment. So, we just need a way to indicate cat label but what if there are 10,000 labels?

Can you please direct me to the relevant code? I have one feature that is categorical with three possible values, had to convert it to dummy columns, and as you can expect you look bad when I plot the decision tree.

parrt commented 3 years ago

Hi guys, we're thinking about how to solve this. Maybe we show up to some n labels or a specific subset of labels requested by user.

pplonski commented 3 years ago

@parrt good idea! There can be many ways to handle categoricals, so only requested by user category labels can be displayed. Maybe it can be done similar way as in which class names are displayed? User gives a dict as an input argument:

feature_category_labels = {
  "feature_1": {
    0: "category_1",
    1: "category_2", 
    ...
  },
  # next features
}

mihagazvoda commented 1 year ago

Hey! Any update on the issue? Is there a workaround? My cardinality is low (<10). Thank you!

tlapusan commented 1 year ago

@mihagazvoda after spending few tens of minutes and looking into the code, I remembered that we implemented this for TensorFlow random forest, because it can support also categorical (string) values as a feature. You can take a look at the Pclass node.

As a workaround would be to use TF instead of what you are using now... Would be ok for you ?

parrt / dtreeviz

Display labels of categorical features in split nodes #86