RFC prediction are inconsistent when using `max_depth`

skjerns commented 5 years ago

I have created a RandomForestclassifier in Python using sklearn. Now I convert the code to C using sklearn-porter. In around 10-20% of the cases the prediction of the transpiled code is wrong.

I figured that the problem occurs when specifying max_depth.

Here's some code to reproduce the issue:

import numpy as np
import sklearn_porter
from sklearn.ensemble import RandomForestClassifier

train_x = np.random.rand(1000, 8)
train_y = np.random.randint(0, 4, 1000)

# when using max_depth='auto', the problem does not occur
rfc = RandomForestClassifier(n_estimators=10)
rfc.fit(train_x, train_y)
porter = sklearn_porter.Porter(rfc, language='c')
print(porter.integrity_score(train_x)) # 1.0

# now using max_depth=10 the integrity
rfc = RandomForestClassifier(n_estimators=10, max_depth=10)
rfc.fit(train_x, train_y)
porter = sklearn_porter.Porter(rfc, language='c')
print(porter.integrity_score(train_x)) # 0.829

I also saw that Python is performing calculations with double while the C code seems to use float, might that be an issue? (changing float -> double did not change anything unfortunately).

skjerns commented 5 years ago

Looking further into this issue I believe it might be something with the final leave probabilities. They are slightly different when not growing the tree to the max. Therefore the final probability can deviate if the samples are very close to each other

nok commented 5 years ago

Thanks for your work and the given hints. I will check the outputs with more tests. Did you maybe check another languages? Or is it still a C issue?

skjerns commented 5 years ago

I did not check other languages yet, but I assume that they have the same problem. I can check tomorrow.

skjerns commented 5 years ago

Checked it in Java: Same results. I assume it will be the same in other languages.

nok commented 5 years ago

Okay, thank you for the double check. Then I will dig deeper in the original implementation. In particular in the difference between the different max_depth conditions.

skjerns commented 5 years ago

I think a good way to approach is to implement a predict_proba-method. I originally assumed that we just let each tree predict a class and take the majority vote (as it is done in the implementation of sklearn-porter). However, this is not the case and like the reason why we have this discrepancy.

Some more details I found in this stackoverflow comment thread: https://stackoverflow.com/questions/30814231/using-the-predict-proba-function-of-randomforestclassifier-in-the-safe-and-rig (see comments)

1) About prediction precision: I insist but this is not a question of number of trees. Even with a single decision tree you should be able to get probability predicitions with more than one digits. A decision tree aims at clustering the inputs based on some rules (the decision), and these clusters are the leafs of the tree. If you have a leaf with 2 non-spam emails and one spam email from your training data, then the probability prediction for any email that belongs to this leaf/cluster (with regards to the rules established by fitting the model), is : 1/3 for spam and 2/3 for non-spam. – Sebastien Jun 20 '15 at 14:49 2) About the dependencies in predictions: Again Sklearn definition gives the answer : the probability is computed with regards to the leaf (corresponding to your email to test) 's characteristics : the number of instances of each class in this leaf. This is set when your model is fitted, so it only depends on the training data. In conclusion : the result is the probability of instance 1 to spam with 60% whatever the other 9 instances' probabilities are. – Sebastien Jun 20 '15 at 15:00

So I think if a predict_proba method is implemented correctly (instead of majority winner vote), the problems with max_depth will disappear. And another cool feature would be added, class probabilities :)

skjerns commented 5 years ago

This seems to be the case, indeed:

So depending on implementation: predicted probability is either (a) the mean terminal leaf probability across all trees or (b) the fraction of trees voting either class. If out-of-bag(OOB) prediction, then only in trees where sample is OOB. For a single fully grown tree, I would guess the predicted probability only could be 0 or 1 for any class, because all terminal nodes are pure(same label). If the single tree is not fully grown and/or more trees are grown, then the predicted probability can be a positive rational number from 0 to 1.

https://stats.stackexchange.com/questions/193424/is-decision-tree-output-a-prediction-or-class-probabilities

So we'd need to change the internal structure such that each tree does not return the class index but a probability vector.

nok commented 5 years ago

Hello @skjerns, JFYI, I started to implement the predict_proba method for all listed estimators.

For that I began with the DecisionTreeClassifier estimator and the high-level languages. After that I will focus on the RandomForestclassifier estimator with the DecisionTreeClassifier as base estimator.

crea-psfc commented 5 years ago

Hi @nok and @skjerns, I have actually looked into this as I wanted to integrate in the porter C library the functionality for analyzing the feature contributions: https://github.com/andosa/treeinterpreter . This technique allows you to extract the importance of the features, when testing unseen samples and uncovers the drivers of the final Random Forests decision. It basically keeps track of the samples population before/after a split by associating gains/losses to the splitting feature. I'm introducing this as it is a pretty short step between implementing this and the predict_proba method for the forest. I am currently working on that.

@nok, let me know how you wanna proceed and I can commit on a dev branch my changes to the C templates and the __init__.py file.

nok / sklearn-porter

RFC prediction are inconsistent when using `max_depth` #52