nok / sklearn-porter

Transpile trained scikit-learn estimators to C, Java, JavaScript and others.
BSD 3-Clause "New" or "Revised" License
1.28k stars 170 forks source link

Decision tree C code exported by porter has wrong datatype for features array it should be float #43

Open vijaykilledar opened 5 years ago

vijaykilledar commented 5 years ago

C code exported by porter has wrong data type for feature value as double which will cause accuracy percentage.

scikit-learn code

def predict(self, X, check_input=True):

         """Predict class or regression value for X.
        For a classification model, the predicted class for each sample in X is
        returned. For a regression model, the predicted value based on X is
        returned.
        Parameters
        ----------
        X : array-like or sparse matrix of shape = [n_samples, n_features]
            The input samples. Internally, it will be converted to
            ``dtype=np.float32`` and if a sparse matrix is provided
            to a sparse ``csr_matrix``.
        check_input : boolean, (default=True)
            Allow to bypass several input checking.
            Don't use this parameter unless you know what you do.
        Returns
        -------
        y : array of shape = [n_samples] or [n_samples, n_outputs]
            The predicted classes, or the predict values.
        """

porter C Code:

int main(int argc, const char * argv[]) {{
    /* Features: */
    double features[argc-1];
    int i;
    for (i = 1; i < argc; i++) {{
        features[i-1] = atof(argv[i]);
    }}

    /* Prediction: */
    printf("%d", {method_name}(features, 0));
    return 0;

}}
nok commented 5 years ago

Can you please provide some data and code for comparison?

(There is a bigger difference between the internal and textual representation of values in Python I guess.)

vijaykilledar commented 5 years ago

ok I will provide detail example/data tomorrow.

vijaykilledar commented 5 years ago

attaching zip file contains

  1. C program trained for 10000 records with accepting feature float data type
  2. C program trained for 10000 records with accepting feature double data type
  3. Shell script used to calculate the matched records of target binary of above programs
  4. Test data set file
  5. Expected prediction data file porter_attachments.zip
  6. csv file used for training (First column as Target class, and rest of the column as test data set) train_10000.zip

test script output at my end

./test_prediction.sh ./train_10000 ./train_10000_target ./porter_train_10000_double 
test data file - test_data/train_10000
expected prediction data file - test_data/train_10000_target
testing output binray by feeding training data .......
Total records - 10000
Matched prediction records - 9878

./test_prediction.sh ./train_10000 ./train_10000_target ./porter_train_10000 _float
test data file - test_data/train_10000
expected prediction data file - test_data/train_10000_target
testing output binray by feeding training data .......
Total records - 10000
Matched prediction records - 9992
nok commented 5 years ago

Okay, thanks. Can you please validate the data type of your training data?

print(type(X[0]))  # <type 'numpy.float32'> or <type 'numpy.float64'>

For load_digits it's numpy.float64 which is double in C. The integrity check finished without mismatches. So I changed the data to floats with X.astype(np.float32) and finished the integrity check again without errors.

Nevertheless it depends on the data. In general I see the problem of point precisions between data types and programming languages. It could make sense to add a possibility to change the features data type in transpiled output by using a new argument temp_dtype='float'.

Further atof() converts a string to double in C. On the other hand if you want to use floats, you should use strtof() to convert strings to float.

Can you test it?