Prediction for ExtraTree model differs from sklearn (tested for C model)

LambertAn commented 6 years ago

I was trying to implement the predict_proba function for an Extra Tree model when I realized that the result returned by the transpiled version of the model differed from the one returned by sklearn.

My model contains 30 trees and 3 classes, below are the classes predicted by sklearn along side the probabilities for each estimator:

	Proba Class 0	Proba Class 1	Proba Class 2	Predicted class
Estimator 0	0.1765	0.0000	0.8235	2
Estimator 1	0.0000	0.0000	1.0000	2
Estimator 2	0.1667	0.0000	0.8333	2
Estimator 3	0.6923	0.0000	0.3077	0
Estimator 4	0.8125	0.0417	0.1458	0
Estimator 5	0.8374	0.0064	0.1562	0
Estimator 6	0.9727	0.0000	0.0273	0
Estimator 7	0.3429	0.0000	0.6571	2
Estimator 8	0.8391	0.0095	0.1514	0
Estimator 9	0.0000	0.0000	1.0000	2
Estimator 10	0.7266	0.0078	0.2656	0
Estimator 11	0.6220	0.0000	0.3780	0
Estimator 12	0.5000	0.0000	0.5000	0
Estimator 13	0.6117	0.0000	0.3883	0
Estimator 14	0.0000	0.0000	1.0000	2
Estimator 15	0.8687	0.0000	0.1313	0
Estimator 16	1.0000	0.0000	0.0000	0
Estimator 17	0.8468	0.0170	0.1362	0
Estimator 18	0.5595	0.0000	0.4405	0
Estimator 19	0.0714	0.0000	0.9286	2
Estimator 20	0.4600	0.0000	0.5400	2
Estimator 21	0.0000	0.0000	1.0000	2
Estimator 22	0.5217	0.0000	0.4783	0
Estimator 23	0.8322	0.0049	0.1629	0
Estimator 24	0.5000	0.0000	0.5000	0
Estimator 25	0.3333	0.0000	0.6667	2
Estimator 26	1.0000	0.0000	0.0000	0
Estimator 27	0.4545	0.0000	0.5455	2
Estimator 28	0.0000	0.0000	1.0000	2
Estimator 29	0.0000	0.0000	1.0000	2
MODEL	0.4916	0.0029	0.5055	2

17 estimators predict class 0 and 13 predict class 2 BUT the model predicts class 2 because it is the most probable class.

Therefore it seems to me that the transpiled model should also make its decision on the predicted probabilities.

What do you think?

nok commented 6 years ago

Hello @LambertAn, thanks for your detailed report. Can you provide some data to reproduce the behaviour? And did you run the integrity check with integrity_score? What score did you get?

LambertAn commented 6 years ago

Thanks for getting back to me.

Below is code to build a 3-class extra tree classifier on random data.

from sklearn_porter import Porter
from sklearn.ensemble import ExtraTreesClassifier
import numpy as np

# Build random dataset
prng = np.random.RandomState(123)
X = prng.rand(50, 10)
y = prng.randint(0, 3, 50)

# Fit model
model = ExtraTreesClassifier(n_estimators=3, max_depth=3, random_state=prng)
model.fit(X, y)

# export:
porter = Porter(model, language='c')
output = porter.export(embed_data=True)
with open('extratree_randomdataset_original.c', 'w') as f_out:
    f_out.write(output)

# accuracy:
integrity = porter.integrity_score(X)
print(integrity)

# Show details for one point
test_point = X[2:3]
for i in range(0, len(model.estimators_)):
    print ("{}: {} -> {}".format(i, model.estimators_[i].predict_proba(test_point), model.estimators_[i].predict(test_point)))
print (model.predict_proba(test_point))
print (model.predict(test_point))

print (test_point)

The integrity score on the training data is 0.86. Let's look at the result for one of the data point: each estimator predicts a different class:

Estimator 0 predicts class 0 with probabilities [0.45 0.20 0.35] Estimator 1 predicts class 2 with probabilities [0.17 0.08 0.75] Estimator 2 predicts class 1 with probabilities [0.24 0.52 0.24]

The model predicts class 2 with probabilities [0.29 0.27 0.45].

I attached the above python code and 2 C files (the original model as generated by sklearn-porter and a modified version that calculates the probabilities for each estimator as well as the average for the model prediction):

sklearn_porter_issue35.zip

For the above point the original 'predict' method returns class 0 and the new model 'predict_proba method returns: [0.29 0.27 0.45].

I hope it is enough to reproduce the problem.

nok commented 5 years ago

Hello @LambertAn, we found a small bug and fixed it (release/0.7.0: Merge branch 'master' into release/0.7.0). Can you please reinstall the package and test it again?

pip uninstall -y sklearn-porter
pip install --no-cache-dir https://github.com/nok/sklearn-porter/zipball/master

LambertAn commented 5 years ago

Hi, I finally had some time to test but unfortunately this problem was not fixed. I used the python script above and had exactly the same results as before with an integrity score of 0.86.

jonnor commented 1 year ago

I belive this is the same issue as https://github.com/nok/sklearn-porter/issues/52

nok / sklearn-porter

Prediction for ExtraTree model differs from sklearn (tested for C model) #35