ml2cpp step 2 : C++ code design

antoinecarme commented 4 years ago

Need to have a complete specification for the following :

Test datasets (CSV file => C++ std::map)
Classification/Regression/Transformation models : C++ functions used to compute the scores
Classification/Regression/Transformation models : input/output datasets layouts

This spec should evolve when more and more models/features are added.

antoinecarme commented 4 years ago

See #1

antoinecarme commented 4 years ago

The automatically generated code is plain STL C++-17, designed to maintain a strong semantic mapping with the model and allows auditing , debugging and reporting.

antoinecarme commented 4 years ago

The C++ code contains everything needed to compute the predicted values of the model, no external library is needed, and can be compiled for any target hardware platform using any starndard C++ compiler on the market.

antoinecarme commented 4 years ago

Typical generated code for a classification model :

source : https://github.com/antoinecarme/ml2cpp/blob/master/doc/LinearModels/ml2cpp_ridge_classifier_iris.ipynb

namespace  {

    std::vector<std::string> get_input_names(){
        std::vector<std::string> lFeatures = { "Feature_0", "Feature_1", "Feature_2", "Feature_3" };

        return lFeatures;
    }

    std::vector<std::any> get_classes(){
        std::vector<std::any> lClasses = { 0, 1, 2 };

        return lClasses;
    }

    std::vector<std::string> get_output_names(){
        std::vector<std::string> lOutputs = { 
            "Score_0", "Score_1", "Score_2",
            "Proba_0", "Proba_1", "Proba_2",
            "LogProba_0", "LogProba_1", "LogProba_2",
            "Decision", "DecisionProba" };

        return lOutputs;
    }

    tTable compute_classification_scores(std::any Feature_0, std::any Feature_1, std::any Feature_2, std::any Feature_3) {
        auto lClasses = get_classes();

        std::any score_0 = 0.12726862685332171 * Feature_0 + 0.47083648636124975 * Feature_1 + -0.445366165446255 * Feature_2 + -0.1212031740524525 * Feature_3 + -0.6974613707207697;

        std::any score_1 = -0.027607436590965116 * Feature_0 + -0.8779879015502388 * Feature_1 + 0.3719963526001637 * Feature_2 + -0.8328757056773156 * Feature_3 + 2.1132211020903813;

        std::any score_2 = -0.09966119026235667 * Feature_0 + 0.40715141518899145 * Feature_1 + 0.0733698128460983 * Feature_2 + 0.9540788797297515 * Feature_3 + -2.4157597313696253;

        tTable lTable;

        lTable["Score"] = { 
            score_0,
            score_1,
            score_2 
        } ;
        lTable["Proba"] = { 
            std::any(),
            std::any(),
            std::any() 
        } ;
        int lBestClass = get_arg_max( lTable["Score"] );
        auto lDecision = lClasses[lBestClass];
        lTable["Decision"] = { lDecision } ;
        lTable["DecisionProba"] = { lTable["Proba"][lBestClass] };

        recompute_log_probas( lTable );

        return lTable;
    }

    tTable compute_model_outputs_from_table( tTable const & iTable) {
        tTable lTable = compute_classification_scores(iTable.at("Feature_0")[0], iTable.at("Feature_1")[0], iTable.at("Feature_2")[0], iTable.at("Feature_3")[0]);

        return lTable;
    }

} // eof namespace

antoinecarme commented 4 years ago

std::any is used for all types of data, scores , probabilties etc. It is more generic and concise than std::variant. It requires C++-17.

A test dataset is a std c++ map (tTable) that assigns to each column name a vector of std::any (class scores are stored in the same vector, class probabitlities in another one, features are stored separately, etc)

typedef std::vector<std::any> tAnyVector;
typedef std::map<std::string, tAnyVector> tTable;

An input dataset is a particular feature dataset (tTable).

A model output is also a particular dataset (tTable). Models can be chained by taking the output of the previous model as input.

There is some kind of algebra on tTables. 'softmax' is a special operation that takes a tTable with scores and produces a tTable of probabilities. An average of tTables is a tTable (random forest tTable = mean(tTable output of trees)), etc. This algebra is to be extended as more and more complex models are added.

antoinecarme commented 4 years ago

tTables can be read and written to and from CSV files or database tables.

antoinecarme commented 4 years ago

For readability : Each model is a specific C++ namespace. Sub-models (in meta-models and ensembles ) and layers in NNs are also namespaces. This also allows using tens of models generated separately in the same C++ program.

For readability : Use main algorithm steps with meaningful / human-friendly names (map code vocabulary and semantics to the model). The user should be able to validate/inspect/debug the model by looking at the C++ code.

TODO: check if there is a limit on the number of namespaces in the various compilers. A common random forest with 500 trees will generate a C++ code with at least 500 namespaces. SQL allows this, why not C++.

https://github.com/antoinecarme/sklearn2sql_heroku/blob/master/docs/WebService-RandomForest_512_Deploy.ipynb

TODO : check using classes instead of namespaces. A class IS a namespace.

antoinecarme commented 4 years ago

The compiled code should not rely on any external library. C++ is enough to compute any machine learning model "by hand".

antoine@z600:/tmp$ ldd sklearn2sql_cpp_iris_RidgeClassifier_140045544887056.exe
        linux-vdso.so.1 (0x00007ffe7413a000)
        libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f8135758000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f8135614000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f81355fa000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f8135435000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f8135982000)

antoinecarme commented 4 years ago

Typical generated code for a regression model (even simpler) :

source : https://github.com/antoinecarme/ml2cpp/blob/master/doc/LinearModels/ml2cpp_ridge_regressor-boston.ipynb


namespace  {

    std::vector<std::string> get_input_names(){
        std::vector<std::string> lFeatures = { "Feature_0", "Feature_1", "Feature_2", "Feature_3", "Feature_4", "Feature_5", "Feature_6", "Feature_7", "Feature_8", "Feature_9", "Feature_10", "Feature_11", "Feature_12" };

        return lFeatures;
    }

    std::vector<std::string> get_output_names(){
        std::vector<std::string> lOutputs = { "Estimator" };

        return lOutputs;
    }

    tTable compute_regression(std::any Feature_0, std::any Feature_1, std::any Feature_2, std::any Feature_3, std::any Feature_4, std::any Feature_5, std::any Feature_6, std::any Feature_7, std::any Feature_8, std::any Feature_9, std::any Feature_10, std::any Feature_11, std::any Feature_12) {

        tTable lTable;

        std::any  lEstimator = -0.10222110133730666 * Feature_0 + 0.04773129624686468 * Feature_1 + -6.436208742578908e-05 * Feature_2 + 2.627820041255508 * Feature_3 + -11.121694375850694 * Feature_4 + 3.8789420030475736 * Feature_5 + -0.005439894300973365 * Feature_6 + -1.3800822175215268 * Feature_7 + 0.29004395043741604 * Feature_8 + -0.013003140540395218 * Feature_9 + -0.8831486448890916 * Feature_10 + 0.009736544133046856 * Feature_11 + -0.5359293002502585 * Feature_12 + 31.308451803397112;
        lTable[ "Estimator" ] = { lEstimator };

        return lTable;
    }

    tTable compute_model_outputs_from_table( tTable const & iTable) {
        tTable lTable = compute_regression(iTable.at("Feature_0")[0], iTable.at("Feature_1")[0], iTable.at("Feature_2")[0], iTable.at("Feature_3")[0], iTable.at("Feature_4")[0], iTable.at("Feature_5")[0], iTable.at("Feature_6")[0], iTable.at("Feature_7")[0], iTable.at("Feature_8")[0], iTable.at("Feature_9")[0], iTable.at("Feature_10")[0], iTable.at("Feature_11")[0], iTable.at("Feature_12")[0]);

        return lTable;
    }

} // eof namespace

antoinecarme commented 4 years ago

Typical generated code for a feature transformation :

source : https://github.com/antoinecarme/ml2cpp/blob/master/doc/Transformations/ml2cpp_transform_std_scaler_iris.ipynb

namespace  {

    std::vector<std::string> get_input_names(){
        std::vector<std::string> lFeatures = { "Feature_0", "Feature_1", "Feature_2", "Feature_3" };

        return lFeatures;
    }

    std::vector<std::string> get_output_names(){
        std::vector<std::string> lOutputs = { "Feature_0", "Feature_1", "Feature_2", "Feature_3" };

        return lOutputs;
    }

    tTable compute_features(std::any Feature_0, std::any Feature_1, std::any Feature_2, std::any Feature_3) {

        tTable lTable;

        lTable["Feature_0"] = { ( ( Feature_0 - 5.843333333333334 ) / 0.8253012917851409 ) };
        lTable["Feature_1"] = { ( ( Feature_1 - 3.0573333333333337 ) / 0.4344109677354946 ) };
        lTable["Feature_2"] = { ( ( Feature_2 - 3.7580000000000005 ) / 1.759404065775303 ) };
        lTable["Feature_3"] = { ( ( Feature_3 - 1.1993333333333336 ) / 0.7596926279021594 ) };

        return lTable;
    }

    tTable compute_model_outputs_from_table( tTable const & iTable) {
        tTable lTable = compute_features(iTable.at("Feature_0")[0], iTable.at("Feature_1")[0], iTable.at("Feature_2")[0], iTable.at("Feature_3")[0]);

        return lTable;
    }

} // eof namespace

antoinecarme commented 3 years ago

Closing

antoinecarme commented 3 years ago

Typical generated code for an outlier detection (sklearn.covariance._elliptic_envelope.EllipticEnvelope) :

namespace  {

        std::vector<std::string> get_input_names(){
                std::vector<std::string> lFeatures = { "A", "B" };

                return lFeatures;
        }

        std::vector<std::string> get_output_names(){
                std::vector<std::string> lOutputs = { 
                        "AnomalyScore","OutlierIndicator" };

                return lOutputs;
        }
        tTable compute_outlier_scores(std::any A, std::any B) {
                std::any A_c = A - 0.0;

                std::any B_c = B - 0.0;

                std::any lMahalanobis = 4.000000000000003 * A_c * A_c + -6.000000000000005 * A_c * B_c + -6.000000000000004 * B_c * A_c + 10.000000000000009 * B_c * B_c;

                std::any lScore = -lMahalanobis -(-2.0000000000000018);

                tTable lTable;

                lTable["AnomalyScore"] = { lScore } ;
                lTable["OutlierIndicator"] = { ( lScore >= 0.0 ) ? 1 : -1 } ;

                return lTable;
        }

        tTable compute_model_outputs_from_table( tTable const & iTable) {
                tTable lTable = compute_outlier_scores(iTable.at("A")[0], iTable.at("B")[0]);

                return lTable;
        }

} // eof namespace

mllite / ml2cpp

ml2cpp step 2 : C++ code design #3