nok / sklearn-porter

Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
BSD 3-Clause "New" or "Revised" License
1.28k stars 170 forks source link

Export Probability of the predictions for decision trees involved model #13

Open Xiadalei opened 7 years ago

Xiadalei commented 7 years ago

Firstly, thank you for this great project.

For models involving decision tree such as decision tree, random forrest, the probability of the predictions is often as crucial as predictions themselves as it carries more infomation than simply a result.

And as far as implementation go, since the porter needs to build every leafnode, I think it's possible to export the probability of the leafnode then aggregate.

So is there any way to do that?

nok commented 7 years ago

Hello @Xiadalei,

you can use sklearn.tree.DecisionTreeClassifier.predict_proba to get the predicted class probabilities in Python.

The method predict_proba will be integrated in a future release. But already today the internal software design could support different methods/templates (predict, predict_proba and predict_log_proba) of an algorithm.

Best, Darius

Xiadalei commented 7 years ago

Hi @nok, Thanks for the response. What I want to say is that sometimes we need the probability to be exported as well for different platforms indepentantly. so what python has is slightly irrelevant.

Still feel good to see predict_proba to be integrated in future. Guess I can manually add the probability to the trees as a workaround.

nok commented 6 years ago

Hello @Xiadalei ,

can you describe or show how you return the probability of each class? For me it would be a time saver.

Guess I can manually add the probability to the trees as a workaround.

Thanks, Darius

itsinder commented 6 years ago

@nok I also think this feature will add value (I needed it for the ExtraTreesClassifier).

If the leaf node consists of 10 samples of category 1, 50 samples of category 2, and 5 samples of category 3: The prediction probabilities are 10/(10 + 50 + 5) , 50/(10 + 50 + 5), and 5/(10 + 50 + 5) respectively

Hope that helps

itsinder commented 6 years ago

i should have mentioned this earlier, the prediction is simply the class with the maximum samples

nok commented 6 years ago

Thanks, that's really simple. I have that extension on my todo list.

itsinder commented 6 years ago

Awesome, I look forward to being able to use it

nicholasc commented 6 years ago

Hey guys! Any updates on this?

crea-psfc commented 6 years ago

Hi @nok ! Any update on this issue? Incorporating probabilities would have a cascade effect on all the ensemble classifiers, e.g. RandomForest AdaBoost. It would be really cool to have such feature!

nok commented 6 years ago

Hello @nicholasc, hello @crea-psfc,

I noticed all your comments and questions. Please bear with me, but the year started with lot of duties and tasks for me. My free evenings are rare. Please wait or create a pull request with a working implementation. Then I just have to create the test cases to ensure the results.

Darius

nicholasc commented 6 years ago

I manually implemented probabilities in the exported model for now. I may, at some point, try to submit a pull request for automating this.

Do we want this to be working along with the default predict function or would it replace the default inference by a probability?

raosudhir commented 6 years ago

Hello Darius and others!

Thanks for this awesome sklearn-porter project.
I was considering using it to port my scenario but I need support for the 'predict_proba()' method of the RandomForestClassifier. We have some post-processing using the predicted probabilities for each class, so without them it'd a be no go for us.

Is there a timeline for implementation of this feature?

JanTkacik commented 5 years ago

This is my code for doing predict_proba in RandomForestClassifier. Not very nice and not tested for now but it seems to work the same way as in python. When I will have more time I will try to implement it properly for sklearn-porter.

class RandomForestClassifier {
    private class Tree {
        private int[] childrenLeft;
        private int[] childrenRight;
        private double[] thresholds;
        private int[] indices;
        private double[][] classes;

        private double[] predict_proba(double[] features, int node) {
            if (this.thresholds[node] != -2) {
                if (features[this.indices[node]] <= this.thresholds[node]) {
                    return this.predict_proba(features, this.childrenLeft[node]);
                } else {
                    return this.predict_proba(features, this.childrenRight[node]);
                }
            }

            return normalizeNodeClasses(node);
        }

        private double[] normalizeNodeClasses(int node) {
            int class_count = this.classes[node].length;
            double[] result = new double[class_count];
            double sum = 0;
            for (int i = 0; i < class_count; i++){
                sum += this.classes[node][i];
            }
            if(sum == 0) {
                for (int i = 0; i < class_count; i++){
                    result[i] = 1.0 / class_count;
                }
            }
            else
            {
                for (int i = 0; i < class_count; i++){
                    result[i] = this.classes[node][i] / sum;
                }
            }
            return result;
        }

        private double[] predict_proba (double[] features) {
            return this.predict_proba(features, 0);
        }
    }

    private List<Tree> forest;
    private int nClasses;
    private int nEstimators;

    public RandomForestClassifier (String file) throws FileNotFoundException {
        String jsonStr = new Scanner(new File(file)).useDelimiter("\\Z").next();
        Gson gson = new Gson();
        Type listType = new TypeToken<List<Tree>>(){}.getType();
        this.forest = gson.fromJson(jsonStr, listType);
        this.nEstimators = this.forest.size();
        this.nClasses = this.forest.get(0).classes[0].length;
    }

    public double[] predict_proba(double[] features) {
        double[] classes = new double[this.nClasses];
        for (int i = 0; i < this.nEstimators; i++) {
            double[] tree_result = this.forest.get(i).predict_proba(features, 0);
            for (int j = 0; j < this.nClasses; j++) {
                classes[j] += tree_result[j];
            }
        }
        for (int j = 0; j < this.nClasses; j++) {
            classes[j] /= this.nEstimators;
        }
        return classes;
    }
}
nok commented 5 years ago

Thanks @JanTkacik, I will test, refactor and write unit tests for it. It's a good fundament.

JanTkacik commented 5 years ago

I have also ported this code to scala. This scala code is tested and currently used in production, so far with no issues, I can post the code here or somewhere else (or make a PR for scala integration) if anybody is interested

gauravsawant commented 5 years ago

It is simple to get probabilities of each class. Check the code generated by porter.export(). By tweaking the function findMax where class label is assigned based on maximum number of samples in that leaf node, probabilities for individual classes can be found out. `int findMax(int nums[N_CLASSES]) { int index = 0; for (int i = 0; i < N_CLASSES; i++) { index = nums[i] > nums[index] ? i : index;

}
float sum=nums[0]+nums[1];
printf("Probability for Class 1 = %lf\n",nums[0]/sum);
printf("Probability for Class 2 = %lf\n",nums[1]/sum);

return index;

}`

The above code is for C and there were 2 classes.

nok commented 5 years ago

Thanks to all, JFYI, after a huge refactoring with many improvements I started to adapt the predict_proba method to provide class probabilities. Now it's sill in progress.

ttux commented 5 years ago

Thank you for this project. Very useful. I also need the proba. I transpiled a model in java (ExtraTreesClassifier) but all I get is always 1 when what I am after is what I get in python from:

prediction = clf.predict_proba(features)
prob = prediction[0][1]

I have tried to come up with something, modifying the generated java class, based on the various comments here to get this but so far without success. So now I am actually looking at doing it in sklearn_porter. So looking at the code @JanTkacik shared for RandomForestClassifier, I deduced I'd modify sklearn_porter/estimator/classifier/ExtraTreesClassifier/templates/java/exported_class.txt but it doesn't seem my changes are being reflected after I install from source and transpile the model again (pip2 install -e my_path). Am I on the wrong path?

Edit: At the end I managed to modify the generated class based on the comments of @itsinder but I am still interested in getting the porter to do it.

jilljack32 commented 3 years ago

@nok - Thank you for this wonderful project. Any updates on predict_proba support? It would be very helpful.