Minimize dependency - Githubissues

qingyuanxingsi commented 6 years ago

Minimize dependency??It is not so easy to run install.sh successfully!

GillesVandewiele commented 6 years ago

Hey @qingyuanxingsi I know this repository is very dependency-heavy. It is on the one side a collection of many different ensemble and decision tree induction techniques, each requiring its own dependencies.

Did you manage to get it up and running? Feel free to copy-paste your errors so I can help out.

qingyuanxingsi commented 6 years ago

Can you give some instructions on how to run the R inTrees algorithm with xgboost? This is what I want to do most now! May be some demo code will be quite useful? It seems that example.py may contain some information, but it contains too many unnecessary details. A shorter and cleaner guide will be much helpful.
Is the orange package really useful?? Really hard to install it!Still very hard to install orange in Windows!
See the following pic, is this a bug??
This is also red in my IDE.

GillesVandewiele commented 6 years ago

The orange package is only needed if you want the C4.5 decision tree induction algorithm, so yes you can skip that installation for your specific usecase
That is indeed a bug, it only occurs when your column is of datatype datetime, which never happened to be the case for my datasets. I think replacing robj with ro should fix it.
Indeed, should be commented out. It still stems from experiments I conducted with https://www.cv-foundation.org/openaccess/content_cvpr_2015/app/1A_079.pdf

Let me check for point number 1. how this can easily be done. Hold on

qingyuanxingsi commented 6 years ago

@GillesVandewiele Given the rare usage of C4.5, I strongly suggest you remove C4.5 and related dependencies(may be Orange??) from this package, it will be much cleaner, or it is really hard to try to use the code. If it becomes easier to start, many people may be willing to explore.

GillesVandewiele commented 6 years ago

That is true, I could add an extra option that either includes or excludes this package.

That being said, it is not because something is rarely used, that it is not good. I empirically tested all these induction algorithms, and the C4.5 algorithm outperforms the sklearn CART algorithm for almost every single dataset.

qingyuanxingsi commented 6 years ago

@GillesVandewiele Looking forward to your guide on inTrees with xgboost!Much thanks.

GillesVandewiele commented 6 years ago

So here's the outline of what you will need to do, I can write an example script for you, but of course only when I have some time to spare. That will probably be this weekend.

Create your XGBoost ensemble (with Python)

Iterate over the models in your XGBoost ensemble and convert them to GENESIM DecisionTree's. For this, please take a look at https://github.com/IBCNServices/GENESIM/blob/master/constructors/genesim.py#L628 where this is being done.

for idx, tree_string in enumerate(xgb_model.clf._Booster.get_dump()):
tree = self.parse_xgb_tree_string(tree_string, train, feature_cols, label_col,
                                         np.unique(train[label_col].values)[idx % n_classes])
tree_list.append(tree)

Now you have a list of GENESIM DecisionTree objects, which can be passed to inTrees, after converting them to compliant R DataFrames, using the _tree_to_R_object function in the inTrees.py file (https://github.com/IBCNServices/GENESIM/blob/0b11dc24fdd3f01aa8598c5aa5f65359173c7696/constructors/inTrees.py#L183 and https://github.com/IBCNServices/GENESIM/blob/0b11dc24fdd3f01aa8598c5aa5f65359173c7696/constructors/inTrees.py#L240)

Execute inTrees as follows (the only variable in the snippet below is treeList, which is the list of GENESIM DecisionTrees


    ro.globalenv["treeList"] = ro.Vector([len(treeList), ro.Vector(treeList)])
    ro.r('names(treeList) <- c("ntree", "list")')

    rules = ro.r('buildLearner(getRuleMetric(extractRules(treeList, X), X, target), X, target)')
    rules=list(rules)
    conditions=rules[int(0.6*len(rules)):int(0.8*len(rules))]
    predictions=rules[int(0.8*len(rules)):]

    # Create a OrderedRuleList
    rulesets = []
    for idx, (condition, prediction) in enumerate(zip(conditions, predictions)):
        # Split each condition in Rules to form a RuleSet
        rulelist = []
        condition_split = [x.lstrip().rstrip() for x in condition.split('&')]
        for rule in condition_split:
            feature = feature_mapping_reverse[int(re.findall(r',[0-9]+]', rule)[0][1:-1])]

            lte = re.findall(r'<=', rule)
            gt = re.findall(r'>', rule)
            eq = re.findall(r'==', rule)
            cond = lte[0] if len(lte) else (gt[0] if len(gt) else eq[0])

            extract_value = re.findall(r'[=>]-?[0-9\.]+', rule)
            if len(extract_value):
                value = float(re.findall(r'[=>]-?[0-9\.]+', rule)[0][1:])
            else:
                feature = 'True'
                value = None

            rulelist.append(Condition(feature, cond, value))
        rulesets.append(Rule(idx, rulelist, prediction))

return OrderedRuleList(rulesets)



Entirely taken from https://github.com/IBCNServices/GENESIM/blob/0b11dc24fdd3f01aa8598c5aa5f65359173c7696/constructors/inTrees.py#L244 . Make sure to import the OrderedRuleList objects etc.

qingyuanxingsi commented 6 years ago

@GillesVandewiele Following your guide, I've written the following code snippet! Mind checking it for me, in case I made any mistakes.

Note: I made two methods of inTreesClassifier public to shorten the code.

# -*- coding:utf-8 -*-

import xgboost as xgb
import pickle
from constructors.inTrees import inTreesClassifier, Rule, Condition
from constructors.ensemble import XGBClassification
from constructors.genesim import GENESIM
import re

import numpy as np
import pandas as pd
import rpy2
from rpy2.robjects import pandas2ri

pandas2ri.activate()
import rpy2.robjects as ro

local_model = r'xxx.model'
train_df_file = r'xxx.pkl'
# python data frame
train_df = pickle.load(open(train_df_file, 'rb'))

bst = xgb.Booster({'nthread': 4})
bst.load_model(local_model)

genesim = GENESIM()
feat_names = ["aaa", "bbb", "ccc"]

# generate feature mapping
feature_mapping = {}
feature_mapping_reverse = {}
for idx, feat in enumerate(feat_names):
    feature_mapping[feat] = idx + 1
    feature_mapping_reverse[idx + 1] = feat

inTrees_clf = inTreesClassifier()
algo = XGBClassification()
treeList = []
for idx, tree_string in enumerate(bst.get_dump()):
    # binary classfication
    tree = genesim.parse_xgb_tree_string(tree_string,
                                         train_df,
                                         feature_cols=feat_names,
                                         label_col='label',
                                         the_class=0)
    treeList.append(inTrees_clf.tree_to_R_object(tree, feature_mapping))

ro.globalenv["treeList"] = ro.Vector([len(treeList), ro.Vector(treeList)])
ro.r('names(treeList) <- c("ntree", "list")')

rules = ro.r('buildLearner(getRuleMetric(extractRules(treeList, X), X, target), X, target)')
rules = list(rules)
conditions = rules[int(0.6 * len(rules)):int(0.8 * len(rules))]
predictions = rules[int(0.8 * len(rules)):]

# Create a OrderedRuleList
rulesets = []
for idx, (condition, prediction) in enumerate(zip(conditions, predictions)):
    # Split each condition in Rules to form a RuleSet
    rulelist = []
    condition_split = [x.lstrip().rstrip() for x in condition.split('&')]
    for rule in condition_split:
        feature = feature_mapping_reverse[int(re.findall(r',[0-9]+]', rule)[0][1:-1])]

        lte = re.findall(r'<=', rule)
        gt = re.findall(r'>', rule)
        eq = re.findall(r'==', rule)
        cond = lte[0] if len(lte) else (gt[0] if len(gt) else eq[0])

        extract_value = re.findall(r'[=>]-?[0-9\.]+', rule)
        if len(extract_value):
            value = float(re.findall(r'[=>]-?[0-9\.]+', rule)[0][1:])
        else:
            feature = 'True'
            value = None

        rulelist.append(Condition(feature, cond, value))
    rulesets.append(Rule(idx, rulelist, prediction))

# print rules
for rule in rulesets:
    print(rules)

GillesVandewiele commented 6 years ago

Looks good at first sight @qingyuanxingsi . Strong work! I'll check it out right now, if you want, you can always make a pull request (call the file xgb_intrees.py or smth).

GillesVandewiele commented 6 years ago

I made some small adaptations to your code. I got it up and running now :)

# -*- coding:utf-8 -*-

import xgboost as xgb
import pickle
from constructors.inTrees import inTreesClassifier, Rule, Condition, OrderedRuleList
from constructors.ensemble import XGBClassification
from constructors.genesim import GENESIM
import re

import numpy as np
import pandas as pd
import rpy2
from rpy2.robjects import pandas2ri
from rpy2.robjects.packages import importr

from sklearn.datasets import make_classification

pandas2ri.activate()
import rpy2.robjects as ro

# Create a dataframe with feature and target columns 
X, y = make_classification(n_samples=500, n_features=3, n_redundant=0)
train_df = pd.DataFrame(X)
feat_names = ["aaa", "bbb", "ccc"]
train_df.columns = feat_names
train_df['label'] = pd.Series(y)

# Fit an XGBClassifier
bst = xgb.XGBClassifier()
bst.fit(train_df[feat_names], train_df['label'])

# generate feature mapping
feature_mapping = {}
feature_mapping_reverse = {}
for idx, feat in enumerate(feat_names):
    feature_mapping[feat] = idx + 1
    feature_mapping_reverse[idx + 1] = feat

# Now the real work. Iterate over the dumps (string format) of the 
# different models/trees in our XGBoost model. Convert them to
# a `decisiontree`
inTrees_clf = inTreesClassifier()
algo = XGBClassification()
treeList = []
genesim = GENESIM()
for idx, tree_string in enumerate(bst._Booster.get_dump()):
    # binary classfication
    tree = genesim.parse_xgb_tree_string(tree_string,
                                         train_df,
                                         feature_cols=feat_names,
                                         label_col='label',
                                         the_class=0)
    treeList.append(inTrees_clf._tree_to_R_object(tree, feature_mapping))

# Do some python magic: call the R module inTrees with our newly composed
# treelist, consisting of GENESIM `decisiontree`s
importr('inTrees')
ro.globalenv["X"] = pandas2ri.py2ri(train_df[feat_names])
ro.globalenv["target"] = ro.FactorVector(train_df['label'])
ro.globalenv["treeList"] = ro.Vector([len(treeList), ro.Vector(treeList)])
ro.r('names(treeList) <- c("ntree", "list")')

rules = ro.r('buildLearner(getRuleMetric(extractRules(treeList, X), X, target), X, target)')
rules = list(rules)

print('Standard output from the inTrees algorithm:')
print(rules)

# Now parse the std output into python object so that they can be used
# for classification etc.
conditions = rules[int(0.6 * len(rules)):int(0.8 * len(rules))]
predictions = rules[int(0.8 * len(rules)):]

print(conditions)

# Create a OrderedRuleList
rulesets = []
for idx, (condition, prediction) in enumerate(zip(conditions, predictions)):
    # Split each condition in Rules to form a RuleSet
    rulelist = []
    condition_split = [x.lstrip().rstrip() for x in condition.split('&')]
    for rule in condition_split:
        feature = feature_mapping_reverse[int(re.findall(r',[0-9]+]', rule)[0][1:-1])]

        lte = re.findall(r'<=', rule)
        gt = re.findall(r'>', rule)
        eq = re.findall(r'==', rule)
        cond = lte[0] if len(lte) else (gt[0] if len(gt) else eq[0])

        extract_value = re.findall(r'[=>]-?[0-9\.]+', rule)
        if len(extract_value):
            value = float(re.findall(r'[=>]-?[0-9\.]+', rule)[0][1:])
        else:
            feature = 'True'
            value = None

        rulelist.append(Condition(feature, cond, value))
    rulesets.append(Rule(idx, rulelist, prediction))
orl = OrderedRuleList(rulesets)

# print rules
print('Parsed rules:')
orl.print_rules()

Btw, I don't know if you knew this already: but I had to hack my way around a bit to get a probability for each class in the leaves of the XGBoost Decision Trees (gradient boosting models work a lot different than the other classical ensemble techniques). Make sure to check out https://github.com/dmlc/xgboost/issues/1746 to get some more information on that :)

Finally, just out of interest: in what kind of application and how are you going to use GENESIM?

qingyuanxingsi commented 6 years ago

@GillesVandewiele
Thanks for your help. However rpy2 doesn't support Windows by now(or not well). So what I'm trying to do now is to export the tree data frame to a file and later load it in R and generate the rules. I'm not familiar in R, can you help me modify the code to make it work.

library(inTrees)
library(xgboost)
library(randomForest)

# binary model file
bst <- xgb.load("E:\\data\\jump\\xxx.model")

tree_dir <- "E:\\data\\jump\\gen_rule_v1"
train_data_file <- "E:\\data\\jump\\rp_jump_train_pd.csv"

filenames <- list.files(tree_dir)

treeNum <- length(filenames)

train_ds <- read.csv(train_data_file)

treeList <- NULL
treeList$ntree <- treeNum
treeList$list <- vector("list", treeNum)
for (j in 1:treeNum) {
  cur_filename = paste(tree_dir, "\\", filenames[j], sep = "")
  cur_df <- read.csv(cur_filename)
  row.names(cur_df) <- cur_df$id
  cur_df$id <- NULL
  treeList$list[[j]] <- cur_df
}

X <- train_ds[, 1:(ncol(train_ds) - 1)]
target <- train_ds[, "label"]

exec <- extractRules(treeList, X)
exec[1:2,]

Here is content of one of the exported trees:

id,left daughter,right daughter,split var,split point,status,prediction
1,2,3,combo_avg_,3.13423,1,0
2,4,5,time_min_,1.813,1,0
4,8,9,hit_cnt_,5.5,1,0
8,16,17,time_avg_,2.5265,1,0
16,0,0,,0.0,-1,1
17,0,0,,0.0,-1,0
9,18,19,time_wait_,129.0,1,0
18,0,0,,0.0,-1,1
19,0,0,,0.0,-1,0
5,10,11,score_,322.0,1,0
10,20,21,time_avg_,4.1905,1,0
20,0,0,,0.0,-1,1
21,0,0,,0.0,-1,0
11,22,23,time_avg_,2.788,1,0
22,0,0,,0.0,-1,1
23,0,0,,0.0,-1,0
3,6,7,time_avg_,2.2305,1,0
6,12,13,time_min_,0.616,1,0
12,0,0,,0.0,-1,0
13,24,25,combo_avg_,4.53862,1,0
24,0,0,,0.0,-1,1
25,0,0,,0.0,-1,1
7,14,15,fast_action_,3.5,1,0
14,26,27,score_,3329.5,1,0
26,0,0,,0.0,-1,0
27,0,0,,0.0,-1,1
15,28,29,per_step_val_,12.2173,1,0
28,0,0,,0.0,-1,1
29,0,0,,0.0,-1,0

Much thanks!

Usage: I'm exploring generating rules from a xgboost model and make it a rule-based classifier, if it is understandable by human, it will be much helpful.

GillesVandewiele commented 6 years ago

I wish I could help, but my knowledge of R is very very limited... You just need to create dataframes that are the same as the output of my _tree_to_R_object function.

Other options are using a good OS for development ;) or just use a docker image (this repo has a Dockerfile already).

I would be interested to hear about results you are achieving with this approach, especially how they compare to rule learners that operate directly on the data (RIPPER, CN2, ...)

GillesVandewiele commented 6 years ago

Also, maybe you can get the rpy2 library working in Windows anyway, but using another method that just pip install

https://stackoverflow.com/questions/14882477/rpy2-install-on-windows-7

qingyuanxingsi commented 6 years ago

Rule learners that operate directly on the data (RIPPER, CN2, ...) Can you give me some papers(link) on these methods, learning rules directly from data can be an alternative direction, as sometimes you cannot use(trust) ml algorithms for prediction!

GillesVandewiele commented 6 years ago

Sure: https://link.springer.com/content/pdf/10.1007/s10994-005-5011-x.pdf

The first author, Furnkranz, has a lot of work around rule learning. One paragraph in that paper lists all prominent algorithms, with corresponding references. (first paragraph of section 3)

Btw, this is where the Orange package, comes in play again. It has implementations of e.g. CN2

Another note is that decision trees can easily be converted to rule lists as well, by just listing all paths from the root to leaf nodes, so every decision tree induction technique and techniques such as GENESIM or ISM could be handy as well :). Moreover, I think the representation format of decision trees is much more interpretable than rule lists (Fig. 1 of https://biblio.ugent.be/publication/8537061/file/8537064.pdf)

qingyuanxingsi commented 6 years ago

@GillesVandewiele Finally make it working in Windows, much thanks.

Moreover, can you parse the metrics of the learnt rules to the output, so I can analysis the generated rules?? Just like the R output below!

GillesVandewiele commented 6 years ago

Yes you can. OrderedRuleList has a prediction function, which allows you to calculate stuff such as accuracy (the inverse of error). Moreover, you can also calculate coverage for each rule by counting how many times a certain rule gets triggered on your dataset.

qingyuanxingsi commented 6 years ago

@GillesVandewiele Well, I mean cannot you just parse the freq and err from the inTrees std output??Here!!!

print('Standard output from the inTrees algorithm:')
print(rules)

Or, can you tell me the format of the output of the inTrees package? so I can parse it myself.

GillesVandewiele commented 6 years ago

@qingyuanxingsi good point! Of course you can :)

the lengths are in rules[:int(len(rules)0.2)] (first 20% entries) the frequency is in the next 20% rules[int(len(rules)0.2):int(len(rules)0.4)] and finally error in the next 20% rules[int(len(rules)0.4):int(len(rules)*0.6)]

GillesVandewiele commented 6 years ago

lengths = rules[:int(0.2 * len(rules))]
frequencies = rules[int(0.2 * len(rules)):int(0.4 * len(rules))]
errors = rules[int(0.4 * len(rules)):int(0.6 * len(rules))]
conditions = rules[int(0.6 * len(rules)):int(0.8 * len(rules))]
predictions = rules[int(0.8 * len(rules)):]

GillesVandewiele commented 6 years ago

@qingyuanxingsi did you manage to get everything up and running? did you obtain any nice results with it? Else I'm going to close the issue :)

predict-idlab / GENESIM

Minimize dependency #9