Closed sina-programer closed 5 months ago
Hi, Good to know that TE2Rules is helpful in building your system. Regarding the two points that you have raised, here are some suggestions:
1) Feature names: If it is not too much work, can you try mapping your column names to something that is ASCII and with hyphens replaced with underscore? A key assumption here is that features used by the tree ensemble are human understandable. We do not support hyphens (and some other special characters) that interferes with pandas while running queries on the datafame (Ref: Pandas query string where column name contains special characters).
2) As for as the speed in concerned, can you give us some more information on how large are your tree ensembles. Also, how many stages are you using? Often, running 2 or 3 stages is enough to guarantee explanations (rules) with high fidelity. Also, try using only a slice of the training data to run model.explain()
. Typically 10% to 20% of the training data should be good enough for large datasets.
If you see a better way to handle any of these concerns, let us know. We can discuss about the solution and potentially integrate it.
In this case, we have to keep our feature names to what they are, in fact, we make them this shape ourselves We need to have the exact character '-' in our feature names because they are calculated and the feature name is really the formula! (we need to formula because we want to reverse-engineering) And in the future, it's possible to have more characters in feature names such as '/' or '*' for making more compound & powerful formulas
First of all, our goal is to create more and different rules without losing any rule, in the peak of quality (even over-fitting)
And by your definition of some parameters, I found that they are not very useful in my project
For instance, the parameter stages
that you mentioned above, really limits the domain of rules and we won't it at all! (limiting the rules is the opposite of our desire)
That is why I used stages=None
to crawl all over the model
And about data, I should examine this idea, because the data is continuous and related (sequential); So if it works with fewer data as it does with whole data, it's great!
Ultimately, our main goal is quality, speed is the second priority, and we won't ever want to have a low-quality output in spite of a fast process!
Hi there,
I tried using this package for rule generation, and the codes failed when I ran this part. Can you share insights on the fix?
model_explainer = ModelExplainer(
model=model,
feature_names=feature_lst
)
rules = model_explainer.explain(
X=X_train[feature_lst], y=y_train_pred, num_stages = 10,min_precision = 0.90
)
Here is the error message -
2023-08-20 00:17:44,885 - root - INFO -
2023-08-20 00:17:44,885 - root - INFO - Positives: 8758
2023-08-20 00:17:44,886 - root - INFO -
2023-08-20 00:17:44,886 - root - INFO - Rules from trees
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-397-f16a4f91541f> in <module>
1 rules = model_explainer.explain(
----> 2 X=X_train[feature_lst], y=y_train_pred,num_stages = 10,min_precision = 0.55
3 )
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/te2rules/explainer.py in explain(self, X, y, num_stages, min_precision)
183 min_precision=min_precision,
184 )
--> 185 rules = self.rule_builder.explain(X, y)
186 rules_as_str = [str(r) for r in rules]
187 return rules_as_str
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/te2rules/explainer.py in explain(self, X, y)
365 log.info("")
366 log.info("Rules from trees")
--> 367 self.candidate_rules = self.random_forest.get_rules(data=self.data)
368 self.solution_rules: List[Rule] = []
369 log.info(str(len(self.candidate_rules)) + " candidate rules")
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/te2rules/tree.py in get_rules(self, data)
352 for tree_index, decision_tree in enumerate(self.decision_tree_ensemble):
353 rules = decision_tree.get_rules(
--> 354 data=data, feature_names=self.feature_names, tree_id=tree_index
355 )
356 rules_from_tree = rules_from_tree + rules
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/te2rules/tree.py in get_rules(self, data, feature_names, tree_id)
223 # self.aggregate_max_decision_value()
224 self._propagate_decision_rule(decision_rule=[])
--> 225 self._propagate_decision_support(data, feature_names, support)
226 rules = self._collect_rules(tree_id=tree_id, node_id=0, rules=[])
227 return rules
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/te2rules/tree.py in _propagate_decision_support(self, data, feature_names, decision_support)
193 right_decision_support = []
194 for index in self.decision_support:
--> 195 if data[index][feature_index] <= self.node.threshold:
196 left_decision_support.append(index)
197 else:
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
3456 if self.columns.nlevels > 1:
3457 return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
3459 if is_integer(indexer):
3460 indexer = [indexer]
~/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 0
Hi RouyiDing, can you confirm that the feature names in the feature_lst
do not have spaces in them. For example, we currently do not support feature names like "employment status". We recommend using feature names without spaces like: "employment_status".
Another comment:
If the model was trained using X_train, then we recommend using X_train (without dropping or rearranging the columns) in model_explainer.explain
. This is important because of the way scikit models are implemented. It references each column of the data using an index and assumes the same features would be present at the same index in any new data point sent to the model in future.
We recommend using X_train
instead of X_train[feature_lst]
as a starting point.
model_explainer = ModelExplainer(
model=model,
feature_names=feature_lst
)
rules = model_explainer.explain(
X=X_train, y=y_train_pred, num_stages = 10,min_precision = 0.90
)
Thank you @groshanlal for your suggestion. Very helpful. I was able to get the package to successfully run after using array type X_train instead of X_train[feature_lst]
Hello dears
I am a data scientist, and I'm working on extracting rules from different models such as Random-Forst or XGBoost Fortunately, I found this repo and now I can develop my system
But there are some issues; For example, some of our feature names are like this 'F01-F02', and there are some characters out of ASCII chars If it's possible, remove the limitation of only ASCII chars in feature names, or let me do that myself
Another thing that I've faced, is speed! I am exporting rules from many models thousands of times Is there any way to make the process faster?