nok / sklearn-porter

Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
BSD 3-Clause "New" or "Revised" License
1.28k stars 170 forks source link

Decision tree classifier porter C code predicting index of classes not actual class #37

Open vijaykilledar opened 6 years ago

vijaykilledar commented 6 years ago

Attaching training data csv file where first column is target class to predict. I generated pickle file and using sklearn-porter command line i convert pickle file to C Code and ran it. C code returning index of classes not actual classes and Python predict() function returns actual class not index.

Attaching training csv file pickle file. csv_and_pickle_file.zip

nok commented 5 years ago

Yes you are right. In general all estimators in all programming languages return the index of the resulted label (y), because it would be an overhead to reimplement the mapping for each programming language. Nevertheless I noted this requirement for a future release.

vijaykilledar commented 5 years ago

for export json option can we add classes name array to JSON data ?

HTCode commented 5 years ago

A hack for this would be to put in the exported JSON data few labelled training samples ((x_i, y_i), ...) from each class for which we know that the python classifier predicts correctly their classes (i.e. the most confident training samples from each class). Then in the target language, one can match the indexes provided by clf.predict(x_i) to their actual labels y_i ...

skjerns commented 5 years ago

I just came across this problem as well:

In my case, I have input classes ranging from [1,2,3,4,5], however there is no example in the training for class 2. As a result, the C-version of my random forest outputs classes [1,2,3,4], with 2,3,4 being actually 3,4,5. Is there any way to prevent that, or are there ideas to fix this without tampering with the C code?

I have a semi-production pipeline where sometimes classes are not part of the training set, and I would be glad to have some way to automatically correct that without manually putting class labels into the c code.

(see also https://github.com/BayesWitnesses/m2cgen/issues/77 where I outline this problem in more detail)

skjerns commented 5 years ago

I solved it temporarily by writing a small wrapper:

It adds a conversion function to the c code and embeds it:

int idx2label(int class_idx) { 
    int labels[5] = {0,2,3,4,5}; // your original ints
    return labels[class_idx];
}
import sklearn_porter
def save_model_sklearn_porter(clf, file):
    """
    Saves an sklearn model which keeps the original class IDs, even if they are not consecutive.     
    """
    porter = sklearn_porter.Porter(clf, language='C')
    output = porter.export(embed_data=True)

    # see which labels are in the classifier, so far only ints are supported
    labels = [str(int(i)) for i in clf.classes_]

    # create new label code and conversion function
    labels_code = 'int labels[{}] = {{{}}}'.format(len(labels), ','.join(labels))
    convert_func = '\n\nint idx2label(int class_idx) { \n' +\
                   '    {};\n    return labels[class_idx];\n}}\n\n'.format(labels_code)

    # insert this function in the beginning of the file
    lines = output.splitlines()
    position = 0
    for idx, line in enumerate(lines): 
        if line.strip().startswith('#'): position=idx
    lines.insert(position+1, convert_func)
    output = '\n'.join(lines)

    # replace last occurrence of `return class_idx` with the label transfer function
    # with [::-1] we can revert the string and look for the first element as if it where the last
    output = output[::-1].replace('return class_idx'[::-1], 'return idx2label(class_idx)'[::-1], 1)[::-1]

    with open(file, 'w') as file:
        file.write(output)
    return output