Open pplonski opened 4 years ago
Label-encoding categorical variables for decision trees or random forests is the easiest and most correct solution. That means converting every category level to a unique integer so it sounds like you're doing the right thing. If you pass in the list of category names that should display that in the trees.
I was trying to replace columns Categorical <-> Integer but I cant make it work.
Here is the example:
import pandas as pd
from dtreeviz.trees import *
# example data set
df = pd.DataFrame({"feature_1": ["a","a","a","a","a","b","b","b","b","b"],
"feature_2": [0,0,0,0,1,0,1,1,1,1],
"target": [0,0,0,0,0,1,1,1,1,1]})
# apply categorical conversion
df["feature_1_converted"] = [0,0,0,0,0,1,1,1,1,1]
# train the tree with converted feature
classifier = tree.DecisionTreeClassifier(max_depth=1)
classifier.fit(df[["feature_1_converted", "feature_2"]], df["target"])
# try to plot the tree with original features
viz = dtreeviz(classifier,
df[["feature_1", "feature_2"]],
df["target"],
target_name='target',
feature_names=["feature_1", "feature_2"],
class_names=["0", "1"])
I got error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-61889aa7e363> in <module>
4 target_name='target',
5 feature_names=["feature_1", "feature_2"],
----> 6 class_names=["0", "1"]
7 )
8
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/trees.py in dtreeviz(tree_model, X_train, y_train, feature_names, target_name, class_names, precision, orientation, instance_orientation, show_root_edge_labels, show_node_labels, show_just_path, fancy, histtype, highlight_path, X, max_X_features_LR, max_X_features_TD, label_fontsize, ticks_fontsize, fontname, colors, scale)
778
779 shadow_tree = ShadowDecTree(tree_model, X_train, y_train,
--> 780 feature_names=feature_names, class_names=class_names)
781
782 if X is not None:
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/shadow.py in __init__(self, tree_model, X_train, y_train, feature_names, class_names)
58 y_train = y_train.values
59 self.y_train = y_train
---> 60 self.node_to_samples = ShadowDecTree.node_samples(tree_model, X_train)
61 if self.isclassifier():
62 self.unique_target_values = np.unique(y_train)
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/dtreeviz/shadow.py in node_samples(tree_model, data)
198 # Doc say: "Return a node indicator matrix where non zero elements
199 # indicates that the samples goes through the nodes."
--> 200 dec_paths = tree_model.decision_path(data)
201
202 # each sample has path taken down tree
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/tree/_classes.py in decision_path(self, X, check_input)
495 indicates that the samples goes through the nodes.
496 """
--> 497 X = self._validate_X_predict(X, check_input)
498 return self.tree_.decision_path(X)
499
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/tree/_classes.py in _validate_X_predict(self, X, check_input)
378 """Validate X whenever one tries to predict, apply, predict_proba"""
379 if check_input:
--> 380 X = check_array(X, dtype=DTYPE, accept_sparse="csr")
381 if issparse(X) and (X.indices.dtype != np.intc or
382 X.indptr.dtype != np.intc):
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
529 array = array.astype(dtype, casting="unsafe", copy=False)
530 else:
--> 531 array = np.asarray(array, order=order, dtype=dtype)
532 except ComplexWarning:
533 raise ValueError("Complex data not supported\n"
~/sandbox/mljar-supervised/venv/lib/python3.6/site-packages/numpy/core/numeric.py in asarray(a, dtype, order)
536
537 """
--> 538 return array(a, dtype, copy=False, order=order)
539
540
ValueError: could not convert string to float: 'a'
Can you use the LabelEncoder()? Or do it manually via https://github.com/parrt/stratx/blob/master/notebooks/support.py
def df_string_to_cat(df:pd.DataFrame) -> dict:
catencoders = {}
for colname in df.columns:
if is_string_dtype(df[colname]) or is_object_dtype(df[colname]):
df[colname] = df[colname].astype('category').cat.as_ordered()
catencoders[colname] = df[colname].cat.categories
return catencoders
def df_cat_to_catcode(df):
for col in df.columns:
if is_categorical_dtype(df[col]):
df[col] = df[col].cat.codes + 1
ooh! you can't use
df[["feature_1", "feature_2"]],
gotta use encoded feature 1
I do it all the time. have you looked at the examples?
@tlapusan did we break something? Can you take a look and help @pplonski ?
Hi @parrt, sure, I will take a look soon, I hope that today
On Tue, Apr 14, 2020, 23:05 Terence Parr notifications@github.com wrote:
@tlapusan https://github.com/tlapusan did we break something? Can you take a look and help @pplonski https://github.com/pplonski ?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/parrt/dtreeviz/issues/86#issuecomment-613653673, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADBYWNXH33ZNRWM2U2O5473RMS6WZANCNFSM4MHY5O5Q .
@tlapusan did we break something? Can you take a look and help @pplonski ?
@parrt we did't break anything ;)
@pplonski it was very helpful that you have send your code, it helped a lot for debugging . The issue is that you've trained the model using df[["feature_1_converted", "feature_2"]] and df[["feature_1", "feature_2"]] to call dtreeviz() method. You need to have the same set of columns.
Please let a comment if it's working for you now.
Do you think, is it possible to pass original categorical values, to be printed in the output tree? I would like to see 'a' and 'b' in feature_1
in the plot.
The expected output tree:
Hang on. you're not talking about the target. ok, let me look.
@tlapusan ha! We don't have an example where the tree nodes are cat vars! We should think about this. Nonetheless, you gotta pass in encoded vars to the classifier. We just need a split cat node example that shows how to get labels. Here's an example with a catvar split node, such as ProductID. Not sure we can label those when there are so many. maybe we just label the cats on either side of split point?
Is there an option to pass encoding as an argument?
@parrt we do have example for categorical nodes, but not in the readme page. we have them here, on titanic dataset (cabin_label feature): https://github.com/parrt/dtreeviz/blob/master/notebooks/tree_structure_example.ipynb.
Would be nice and helpful to create a github wiki, to document the library even better. Putting everything in readme is kind of hard to follow and browse, especially when library will contain event more vizualisations :)
Right, if the categorical variable has a high cardinality, it's gonna be very hard to display their raw labels...and maybe is even more confusing to do so. But, yes, we need to see and discuss on a more concrete example to see how it looks.
Only in the case of categorical ordinal features would make more sense to display raw values. But I don't know an automatic way to detect encoded ordinal features. There are many ways to encode categorical variables, and to implement specific code for all of them...don't know if it is worth it.
@pplonski what's the cardinality of your categorical features ?
Hi, I'm also stuck at the same place. We need to use some categorical features in the tree.
Hi @chenhajaj Cats are allowed but it shows their unique cat code at moment. So, we just need a way to indicate cat label but what if there are 10,000 labels?
Hi @chenhajaj Cats are allowed but it shows their unique cat code at moment. So, we just need a way to indicate cat label but what if there are 10,000 labels?
Can you please direct me to the relevant code? I have one feature that is categorical with three possible values, had to convert it to dummy columns, and as you can expect you look bad when I plot the decision tree.
Hi guys, we're thinking about how to solve this. Maybe we show up to some n
labels or a specific subset of labels requested by user.
@parrt good idea! There can be many ways to handle categoricals, so only requested by user category labels can be displayed. Maybe it can be done similar way as in which class names are displayed? User gives a dict as an input argument:
feature_category_labels = {
"feature_1": {
0: "category_1",
1: "category_2",
...
},
# next features
}
Hey! Any update on the issue? Is there a workaround? My cardinality is low (<10). Thank you!
@mihagazvoda after spending few tens of minutes and looking into the code, I remembered that we implemented this for TensorFlow random forest, because it can support also categorical (string) values as a feature. You can take a look at the Pclass node.
As a workaround would be to use TF instead of what you are using now... Would be ok for you ?
Sklearn decision tree doesn't work with categorical values. Before using decision tree, I'm converting categoricals, for example to integers (with
LabelEncoder
). In the tree visualization, there is presented converted value. Is there an option to better handle categoricals?