The limitation of feature-names

sina-programer commented 1 year ago

Hello dears

I am a data scientist, and I'm working on extracting rules from different models such as Random-Forst or XGBoost Fortunately, I found this repo and now I can develop my system

But there are some issues; For example, some of our feature names are like this 'F01-F02', and there are some characters out of ASCII chars If it's possible, remove the limitation of only ASCII chars in feature names, or let me do that myself

Another thing that I've faced, is speed! I am exporting rules from many models thousands of times Is there any way to make the process faster?

groshanlal commented 1 year ago

Hi, Good to know that TE2Rules is helpful in building your system. Regarding the two points that you have raised, here are some suggestions:

1) Feature names: If it is not too much work, can you try mapping your column names to something that is ASCII and with hyphens replaced with underscore? A key assumption here is that features used by the tree ensemble are human understandable. We do not support hyphens (and some other special characters) that interferes with pandas while running queries on the datafame (Ref: Pandas query string where column name contains special characters).

2) As for as the speed in concerned, can you give us some more information on how large are your tree ensembles. Also, how many stages are you using? Often, running 2 or 3 stages is enough to guarantee explanations (rules) with high fidelity. Also, try using only a slice of the training data to run model.explain(). Typically 10% to 20% of the training data should be good enough for large datasets.

If you see a better way to handle any of these concerns, let us know. We can discuss about the solution and potentially integrate it.

sina-programer commented 1 year ago

Feature Names:

In this case, we have to keep our feature names to what they are, in fact, we make them this shape ourselves We need to have the exact character '-' in our feature names because they are calculated and the feature name is really the formula! (we need to formula because we want to reverse-engineering) And in the future, it's possible to have more characters in feature names such as '/' or '*' for making more compound & powerful formulas

Speed:

First of all, our goal is to create more and different rules without losing any rule, in the peak of quality (even over-fitting) And by your definition of some parameters, I found that they are not very useful in my project For instance, the parameter stages that you mentioned above, really limits the domain of rules and we won't it at all! (limiting the rules is the opposite of our desire) That is why I used stages=None to crawl all over the model

And about data, I should examine this idea, because the data is continuous and related (sequential); So if it works with fewer data as it does with whole data, it's great!

Ultimately, our main goal is quality, speed is the second priority, and we won't ever want to have a low-quality output in spite of a fast process!

RouyiDing commented 11 months ago

Hi there,

I tried using this package for rule generation, and the codes failed when I ran this part. Can you share insights on the fix?

model_explainer = ModelExplainer(
    model=model, 
    feature_names=feature_lst
)

rules = model_explainer.explain(
    X=X_train[feature_lst], y=y_train_pred, num_stages = 10,min_precision = 0.90
)

Here is the error message -

2023-08-20 00:17:44,885 - root - INFO - 
2023-08-20 00:17:44,885 - root - INFO - Positives: 8758
2023-08-20 00:17:44,886 - root - INFO - 
2023-08-20 00:17:44,886 - root - INFO - Rules from trees
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-397-f16a4f91541f> in <module>
      1 rules = model_explainer.explain(
----> 2     X=X_train[feature_lst], y=y_train_pred,num_stages = 10,min_precision = 0.55
      3 )

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/te2rules/explainer.py in explain(self, X, y, num_stages, min_precision)
    183             min_precision=min_precision,
    184         )
--> 185         rules = self.rule_builder.explain(X, y)
    186         rules_as_str = [str(r) for r in rules]
    187         return rules_as_str

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/te2rules/explainer.py in explain(self, X, y)
    365         log.info("")
    366         log.info("Rules from trees")
--> 367         self.candidate_rules = self.random_forest.get_rules(data=self.data)
    368         self.solution_rules: List[Rule] = []
    369         log.info(str(len(self.candidate_rules)) + " candidate rules")

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/te2rules/tree.py in get_rules(self, data)
    352         for tree_index, decision_tree in enumerate(self.decision_tree_ensemble):
    353             rules = decision_tree.get_rules(
--> 354                 data=data, feature_names=self.feature_names, tree_id=tree_index
    355             )
    356             rules_from_tree = rules_from_tree + rules

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/te2rules/tree.py in get_rules(self, data, feature_names, tree_id)
    223         # self.aggregate_max_decision_value()
    224         self._propagate_decision_rule(decision_rule=[])
--> 225         self._propagate_decision_support(data, feature_names, support)
    226         rules = self._collect_rules(tree_id=tree_id, node_id=0, rules=[])
    227         return rules

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/te2rules/tree.py in _propagate_decision_support(self, data, feature_names, decision_support)
    193                 right_decision_support = []
    194                 for index in self.decision_support:
--> 195                     if data[index][feature_index] <= self.node.threshold:
    196                         left_decision_support.append(index)
    197                     else:

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3456             if self.columns.nlevels > 1:
   3457                 return self._getitem_multilevel(key)
-> 3458             indexer = self.columns.get_loc(key)
   3459             if is_integer(indexer):
   3460                 indexer = [indexer]

~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 0

groshanlal commented 11 months ago

Hi RouyiDing, can you confirm that the feature names in the feature_lst do not have spaces in them. For example, we currently do not support feature names like "employment status". We recommend using feature names without spaces like: "employment_status".

groshanlal commented 11 months ago

Another comment: If the model was trained using X_train, then we recommend using X_train (without dropping or rearranging the columns) in model_explainer.explain. This is important because of the way scikit models are implemented. It references each column of the data using an index and assumes the same features would be present at the same index in any new data point sent to the model in future.

We recommend using X_train instead of X_train[feature_lst] as a starting point.

model_explainer = ModelExplainer(
    model=model, 
    feature_names=feature_lst
)

rules = model_explainer.explain(
    X=X_train, y=y_train_pred, num_stages = 10,min_precision = 0.90
)

RouyiDing commented 11 months ago

Thank you @groshanlal for your suggestion. Very helpful. I was able to get the package to successfully run after using array type X_train instead of X_train[feature_lst]

linkedin / TE2Rules

The limitation of feature-names #1

Feature Names:

Speed: