Closed qingyuanxingsi closed 6 years ago
Hey @qingyuanxingsi I know this repository is very dependency-heavy. It is on the one side a collection of many different ensemble and decision tree induction techniques, each requiring its own dependencies.
Did you manage to get it up and running? Feel free to copy-paste your errors so I can help out.
The orange package is only needed if you want the C4.5 decision tree induction algorithm, so yes you can skip that installation for your specific usecase
That is indeed a bug, it only occurs when your column is of datatype datetime, which never happened to be the case for my datasets. I think replacing robj
with ro
should fix it.
Indeed, should be commented out. It still stems from experiments I conducted with https://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1A_079.pdf
Let me check for point number 1. how this can easily be done. Hold on
@GillesVandewiele Given the rare usage of C4.5, I strongly suggest you remove C4.5 and related dependencies(may be Orange??) from this package, it will be much cleaner, or it is really hard to try to use the code. If it becomes easier to start, many people may be willing to explore.
That is true, I could add an extra option that either includes or excludes this package.
That being said, it is not because something is rarely used, that it is not good. I empirically tested all these induction algorithms, and the C4.5 algorithm outperforms the sklearn CART algorithm for almost every single dataset.
@GillesVandewiele Looking forward to your guide on inTrees with xgboost!Much thanks.
So here's the outline of what you will need to do, I can write an example script for you, but of course only when I have some time to spare. That will probably be this weekend.
Create your XGBoost ensemble (with Python)
Iterate over the models in your XGBoost ensemble and convert them to GENESIM DecisionTree's. For this, please take a look at https://github.com/IBCNServices/GENESIM/blob/master/constructors/genesim.py#L628 where this is being done.
for idx, tree_string in enumerate(xgb_model.clf._Booster.get_dump()):
tree = self.parse_xgb_tree_string(tree_string, train, feature_cols, label_col,
np.unique(train[label_col].values)[idx % n_classes])
tree_list.append(tree)
Now you have a list of GENESIM DecisionTree objects, which can be passed to inTrees, after converting them to compliant R DataFrames, using the _tree_to_R_object function in the inTrees.py file (https://github.com/IBCNServices/GENESIM/blob/0b11dc24fdd3f01aa8598c5aa5f65359173c7696/constructors/inTrees.py#L183 and https://github.com/IBCNServices/GENESIM/blob/0b11dc24fdd3f01aa8598c5aa5f65359173c7696/constructors/inTrees.py#L240)
Execute inTrees as follows (the only variable in the snippet below is treeList, which is the list of GENESIM DecisionTrees
ro.globalenv["treeList"] = ro.Vector([len(treeList), ro.Vector(treeList)])
ro.r('names(treeList) <- c("ntree", "list")')
rules = ro.r('buildLearner(getRuleMetric(extractRules(treeList, X), X, target), X, target)')
rules=list(rules)
conditions=rules[int(0.6*len(rules)):int(0.8*len(rules))]
predictions=rules[int(0.8*len(rules)):]
# Create a OrderedRuleList
rulesets = []
for idx, (condition, prediction) in enumerate(zip(conditions, predictions)):
# Split each condition in Rules to form a RuleSet
rulelist = []
condition_split = [x.lstrip().rstrip() for x in condition.split('&')]
for rule in condition_split:
feature = feature_mapping_reverse[int(re.findall(r',[0-9]+]', rule)[0][1:-1])]
lte = re.findall(r'<=', rule)
gt = re.findall(r'>', rule)
eq = re.findall(r'==', rule)
cond = lte[0] if len(lte) else (gt[0] if len(gt) else eq[0])
extract_value = re.findall(r'[=>]-?[0-9\.]+', rule)
if len(extract_value):
value = float(re.findall(r'[=>]-?[0-9\.]+', rule)[0][1:])
else:
feature = 'True'
value = None
rulelist.append(Condition(feature, cond, value))
rulesets.append(Rule(idx, rulelist, prediction))
return OrderedRuleList(rulesets)
Entirely taken from https://github.com/IBCNServices/GENESIM/blob/0b11dc24fdd3f01aa8598c5aa5f65359173c7696/constructors/inTrees.py#L244 . Make sure to import the OrderedRuleList objects etc.
@GillesVandewiele Following your guide, I've written the following code snippet! Mind checking it for me, in case I made any mistakes.
Note: I made two methods of inTreesClassifier public to shorten the code.
# -*- coding:utf-8 -*-
import xgboost as xgb
import pickle
from constructors.inTrees import inTreesClassifier, Rule, Condition
from constructors.ensemble import XGBClassification
from constructors.genesim import GENESIM
import re
import numpy as np
import pandas as pd
import rpy2
from rpy2.robjects import pandas2ri
pandas2ri.activate()
import rpy2.robjects as ro
local_model = r'xxx.model'
train_df_file = r'xxx.pkl'
# python data frame
train_df = pickle.load(open(train_df_file, 'rb'))
bst = xgb.Booster({'nthread': 4})
bst.load_model(local_model)
genesim = GENESIM()
feat_names = ["aaa", "bbb", "ccc"]
# generate feature mapping
feature_mapping = {}
feature_mapping_reverse = {}
for idx, feat in enumerate(feat_names):
feature_mapping[feat] = idx + 1
feature_mapping_reverse[idx + 1] = feat
inTrees_clf = inTreesClassifier()
algo = XGBClassification()
treeList = []
for idx, tree_string in enumerate(bst.get_dump()):
# binary classfication
tree = genesim.parse_xgb_tree_string(tree_string,
train_df,
feature_cols=feat_names,
label_col='label',
the_class=0)
treeList.append(inTrees_clf.tree_to_R_object(tree, feature_mapping))
ro.globalenv["treeList"] = ro.Vector([len(treeList), ro.Vector(treeList)])
ro.r('names(treeList) <- c("ntree", "list")')
rules = ro.r('buildLearner(getRuleMetric(extractRules(treeList, X), X, target), X, target)')
rules = list(rules)
conditions = rules[int(0.6 * len(rules)):int(0.8 * len(rules))]
predictions = rules[int(0.8 * len(rules)):]
# Create a OrderedRuleList
rulesets = []
for idx, (condition, prediction) in enumerate(zip(conditions, predictions)):
# Split each condition in Rules to form a RuleSet
rulelist = []
condition_split = [x.lstrip().rstrip() for x in condition.split('&')]
for rule in condition_split:
feature = feature_mapping_reverse[int(re.findall(r',[0-9]+]', rule)[0][1:-1])]
lte = re.findall(r'<=', rule)
gt = re.findall(r'>', rule)
eq = re.findall(r'==', rule)
cond = lte[0] if len(lte) else (gt[0] if len(gt) else eq[0])
extract_value = re.findall(r'[=>]-?[0-9\.]+', rule)
if len(extract_value):
value = float(re.findall(r'[=>]-?[0-9\.]+', rule)[0][1:])
else:
feature = 'True'
value = None
rulelist.append(Condition(feature, cond, value))
rulesets.append(Rule(idx, rulelist, prediction))
# print rules
for rule in rulesets:
print(rules)
Looks good at first sight @qingyuanxingsi . Strong work! I'll check it out right now, if you want, you can always make a pull request (call the file xgb_intrees.py or smth).
I made some small adaptations to your code. I got it up and running now :)
# -*- coding:utf-8 -*-
import xgboost as xgb
import pickle
from constructors.inTrees import inTreesClassifier, Rule, Condition, OrderedRuleList
from constructors.ensemble import XGBClassification
from constructors.genesim import GENESIM
import re
import numpy as np
import pandas as pd
import rpy2
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr
from sklearn.datasets import make_classification
pandas2ri.activate()
import rpy2.robjects as ro
# Create a dataframe with feature and target columns
X, y = make_classification(n_samples=500, n_features=3, n_redundant=0)
train_df = pd.DataFrame(X)
feat_names = ["aaa", "bbb", "ccc"]
train_df.columns = feat_names
train_df['label'] = pd.Series(y)
# Fit an XGBClassifier
bst = xgb.XGBClassifier()
bst.fit(train_df[feat_names], train_df['label'])
# generate feature mapping
feature_mapping = {}
feature_mapping_reverse = {}
for idx, feat in enumerate(feat_names):
feature_mapping[feat] = idx + 1
feature_mapping_reverse[idx + 1] = feat
# Now the real work. Iterate over the dumps (string format) of the
# different models/trees in our XGBoost model. Convert them to
# a `decisiontree`
inTrees_clf = inTreesClassifier()
algo = XGBClassification()
treeList = []
genesim = GENESIM()
for idx, tree_string in enumerate(bst._Booster.get_dump()):
# binary classfication
tree = genesim.parse_xgb_tree_string(tree_string,
train_df,
feature_cols=feat_names,
label_col='label',
the_class=0)
treeList.append(inTrees_clf._tree_to_R_object(tree, feature_mapping))
# Do some python magic: call the R module inTrees with our newly composed
# treelist, consisting of GENESIM `decisiontree`s
importr('inTrees')
ro.globalenv["X"] = pandas2ri.py2ri(train_df[feat_names])
ro.globalenv["target"] = ro.FactorVector(train_df['label'])
ro.globalenv["treeList"] = ro.Vector([len(treeList), ro.Vector(treeList)])
ro.r('names(treeList) <- c("ntree", "list")')
rules = ro.r('buildLearner(getRuleMetric(extractRules(treeList, X), X, target), X, target)')
rules = list(rules)
print('Standard output from the inTrees algorithm:')
print(rules)
# Now parse the std output into python object so that they can be used
# for classification etc.
conditions = rules[int(0.6 * len(rules)):int(0.8 * len(rules))]
predictions = rules[int(0.8 * len(rules)):]
print(conditions)
# Create a OrderedRuleList
rulesets = []
for idx, (condition, prediction) in enumerate(zip(conditions, predictions)):
# Split each condition in Rules to form a RuleSet
rulelist = []
condition_split = [x.lstrip().rstrip() for x in condition.split('&')]
for rule in condition_split:
feature = feature_mapping_reverse[int(re.findall(r',[0-9]+]', rule)[0][1:-1])]
lte = re.findall(r'<=', rule)
gt = re.findall(r'>', rule)
eq = re.findall(r'==', rule)
cond = lte[0] if len(lte) else (gt[0] if len(gt) else eq[0])
extract_value = re.findall(r'[=>]-?[0-9\.]+', rule)
if len(extract_value):
value = float(re.findall(r'[=>]-?[0-9\.]+', rule)[0][1:])
else:
feature = 'True'
value = None
rulelist.append(Condition(feature, cond, value))
rulesets.append(Rule(idx, rulelist, prediction))
orl = OrderedRuleList(rulesets)
# print rules
print('Parsed rules:')
orl.print_rules()
Btw, I don't know if you knew this already: but I had to hack my way around a bit to get a probability for each class in the leaves of the XGBoost Decision Trees (gradient boosting models work a lot different than the other classical ensemble techniques). Make sure to check out https://github.com/dmlc/xgboost/issues/1746 to get some more information on that :)
Finally, just out of interest: in what kind of application and how are you going to use GENESIM?
@GillesVandewiele
Thanks for your help. However rpy2 doesn't support Windows by now(or not well). So what I'm trying to do now is to export the tree data frame to a file and later load it in R and generate the rules. I'm not familiar in R, can you help me modify the code to make it work.
library(inTrees)
library(xgboost)
library(randomForest)
# binary model file
bst <- xgb.load("E:\\data\\jump\\xxx.model")
tree_dir <- "E:\\data\\jump\\gen_rule_v1"
train_data_file <- "E:\\data\\jump\\rp_jump_train_pd.csv"
filenames <- list.files(tree_dir)
treeNum <- length(filenames)
train_ds <- read.csv(train_data_file)
treeList <- NULL
treeList$ntree <- treeNum
treeList$list <- vector("list", treeNum)
for (j in 1:treeNum) {
cur_filename = paste(tree_dir, "\\", filenames[j], sep = "")
cur_df <- read.csv(cur_filename)
row.names(cur_df) <- cur_df$id
cur_df$id <- NULL
treeList$list[[j]] <- cur_df
}
X <- train_ds[, 1:(ncol(train_ds) - 1)]
target <- train_ds[, "label"]
exec <- extractRules(treeList, X)
exec[1:2,]
Here is content of one of the exported trees:
id,left daughter,right daughter,split var,split point,status,prediction
1,2,3,combo_avg_,3.13423,1,0
2,4,5,time_min_,1.813,1,0
4,8,9,hit_cnt_,5.5,1,0
8,16,17,time_avg_,2.5265,1,0
16,0,0,,0.0,-1,1
17,0,0,,0.0,-1,0
9,18,19,time_wait_,129.0,1,0
18,0,0,,0.0,-1,1
19,0,0,,0.0,-1,0
5,10,11,score_,322.0,1,0
10,20,21,time_avg_,4.1905,1,0
20,0,0,,0.0,-1,1
21,0,0,,0.0,-1,0
11,22,23,time_avg_,2.788,1,0
22,0,0,,0.0,-1,1
23,0,0,,0.0,-1,0
3,6,7,time_avg_,2.2305,1,0
6,12,13,time_min_,0.616,1,0
12,0,0,,0.0,-1,0
13,24,25,combo_avg_,4.53862,1,0
24,0,0,,0.0,-1,1
25,0,0,,0.0,-1,1
7,14,15,fast_action_,3.5,1,0
14,26,27,score_,3329.5,1,0
26,0,0,,0.0,-1,0
27,0,0,,0.0,-1,1
15,28,29,per_step_val_,12.2173,1,0
28,0,0,,0.0,-1,1
29,0,0,,0.0,-1,0
Much thanks!
Usage: I'm exploring generating rules from a xgboost model and make it a rule-based classifier, if it is understandable by human, it will be much helpful.
I wish I could help, but my knowledge of R is very very limited... You just need to create dataframes that are the same as the output of my _tree_to_R_object
function.
Other options are using a good OS for development ;) or just use a docker image (this repo has a Dockerfile already).
I would be interested to hear about results you are achieving with this approach, especially how they compare to rule learners that operate directly on the data (RIPPER, CN2, ...)
Also, maybe you can get the rpy2 library working in Windows anyway, but using another method that just pip install
https://stackoverflow.com/questions/14882477/rpy2-install-on-windows-7
Rule learners that operate directly on the data (RIPPER, CN2, ...) Can you give me some papers(link) on these methods, learning rules directly from data can be an alternative direction, as sometimes you cannot use(trust) ml algorithms for prediction!
Sure: https://link.springer.com/content/pdf/10.1007/s10994-005-5011-x.pdf
The first author, Furnkranz, has a lot of work around rule learning. One paragraph in that paper lists all prominent algorithms, with corresponding references. (first paragraph of section 3)
Btw, this is where the Orange package, comes in play again. It has implementations of e.g. CN2
Another note is that decision trees can easily be converted to rule lists as well, by just listing all paths from the root to leaf nodes, so every decision tree induction technique and techniques such as GENESIM or ISM could be handy as well :). Moreover, I think the representation format of decision trees is much more interpretable than rule lists (Fig. 1 of https://biblio.ugent.be/publication/8537061/file/8537064.pdf)
@GillesVandewiele Finally make it working in Windows, much thanks.
Moreover, can you parse the metrics of the learnt rules to the output, so I can analysis the generated rules?? Just like the R output below!
Yes you can. OrderedRuleList has a prediction function, which allows you to calculate stuff such as accuracy (the inverse of error). Moreover, you can also calculate coverage for each rule by counting how many times a certain rule gets triggered on your dataset.
@GillesVandewiele Well, I mean cannot you just parse the freq and err from the inTrees std output??Here!!!
print('Standard output from the inTrees algorithm:')
print(rules)
Or, can you tell me the format of the output of the inTrees package? so I can parse it myself.
@qingyuanxingsi good point! Of course you can :)
the lengths are in rules[:int(len(rules)0.2)] (first 20% entries) the frequency is in the next 20% rules[int(len(rules)0.2):int(len(rules)0.4)] and finally error in the next 20% rules[int(len(rules)0.4):int(len(rules)*0.6)]
lengths = rules[:int(0.2 * len(rules))]
frequencies = rules[int(0.2 * len(rules)):int(0.4 * len(rules))]
errors = rules[int(0.4 * len(rules)):int(0.6 * len(rules))]
conditions = rules[int(0.6 * len(rules)):int(0.8 * len(rules))]
predictions = rules[int(0.8 * len(rules)):]
@qingyuanxingsi did you manage to get everything up and running? did you obtain any nice results with it? Else I'm going to close the issue :)
Minimize dependency??It is not so easy to run install.sh successfully!