rcppmlpack / rcppmlpack2

Rcpp Interface to mlpack (version 2.1.0 and up)
GNU General Public License v2.0
24 stars 9 forks source link

Rcpp Gallery write-up draft #10

Closed eddelbuettel closed 7 years ago

eddelbuettel commented 7 years ago

Poking @thirdwing @coatless @MHenderson : if you have a moment, can you read across this before I post this to the Rcpp Gallery. I have a two-day streak going on new posts, and the day off so I figured I may as well :) In all seriousness, I think it is time to beat the drum a bit more for RcppMLPACK{1,2} and to sort out where we go with RcppMLPACK2.


title: "RcppMLPACK2 and the MLPACK Machine Learning Library" author: "Dirk Eddelbuettel" license: GPL (>= 2) tags: machine_learning armadillo mlpack summary: "RcppMLPACK2 bring access to MLPACK to R"

mlpack

mlpack is, to quote, a scalable machine learning library, written in C++, that aims to provide fast, extensible implementations of cutting-edge machine learning algorithms. It has been written by Ryan Curtin and others, and is described in two papers in BigLearning (2011) and JMLR (2013). mlpack uses Armadillo as the underlying linear algebra library, which, thanks to RcppArmadillo, is already a rather well-known library in the R ecosystem.

RcppMLPACK1

Qiang Kou has created the RcppMLPACK package on CRAN for easy-to-use integration of mlpack with R. It integrates the mlpack sources, and is, as a CRAN package, widely available on all platforms.

However, this RcppMLPACK package is also based on a by-now dated version of mlpack. Quoting again: mlpack provides these algorithms as simple command-line programs and C++ classes which can then be integrated into larger-scale machine learning solutions. Version 2 of the mlpack sources switched to a slightly more encompassing build also requiring the Boost libraries 'program_options', 'unit_test_framework' and 'serialization'. Within the context of an R package, we could condition out the first two as R provides both the direct interface (hence no need to parse command-line options) and also the testing framework. However, it would be both difficult and potentially undesirable to condition out the serialization which allows mlpack to store and resume machine learning tasks.

We refer to this version now as RcppMLPACK1.

RcppMLPACK2

As of February 2017, the current version of mlpack is 2.1.1. As it requires external linking with (some) Boost libraries as well as with Armadillo, we have created a new package RcppMLPACK2 inside a new GitHub organization RcppMLPACK.

Linux

This package works fine on Linux provided mlpack, Armadillo and Boost are installed.

OS X / macOS

For maxOS / OS X, James Balamuta has tried to set up a homebrew recipe but there are some tricky interaction with the compiler suites used by both brew and R on macOS.

Windows

For Windows, one could do what Jeroen Ooms has done and build (external) libraries. Volunteers are encouraged to get in touch via the issue tickets at GitHub.

Example: Logistic Regression

To illustrate mlpack we show a first simple example also included in the package. As the rest of the Rcpp Gallery, these are "live" code examples.


#include <RcppMLPACK.h>             // MLPACK, Rcpp and RcppArmadillo

#include <mlpack/methods/logistic_regression/logistic_regression.hpp>   // particular algorithm used here

// [[Rcpp::depends(RcppMLPACK)]]

// [[Rcpp::export]]
Rcpp::List logisticRegression(const arma::mat& train,
                              const arma::irowvec& labels,
                              const Rcpp::Nullable<Rcpp::NumericMatrix>& test = R_NilValue) {

    // MLPACK wants Row<size_t> which is an unsigned representation
    // that R does not have
    arma::Row<size_t> labelsur, resultsur;

    // TODO: check that all values are non-negative
    labelsur = arma::conv_to<arma::Row<size_t>>::from(labels);

    // Initialize with the default arguments.
    // TODO: support more arguments>
    mlpack::regression::LogisticRegression<> lrc(train, labelsur);

    arma::vec parameters = lrc.Parameters();

    Rcpp::List return_val;

    if (test.isNotNull()) {
        arma::mat test2 = Rcpp::as<arma::mat>(test);
        lrc.Classify(test2, resultsur);
        arma::vec results = arma::conv_to<arma::vec>::from(resultsur);
        return_val = Rcpp::List::create(Rcpp::Named("parameters") = parameters,
                                        Rcpp::Named("results") = results);
    } else {
        return_val = Rcpp::List::create(Rcpp::Named("parameters") = parameters);
    }

    return return_val;

}

We can then call this function with the same (trivial) data set as used in the first unit test for it:

logisticRegression(matrix(c(1, 2, 3, 1, 2, 3), nrow=2, byrow=TRUE), c(1L, 1L, 0L))

Example: Naive Bayes Classifier

A second examples shows the NaiveBayesClassifier class.

#include <RcppMLPACK.h>             // MLPACK, Rcpp and RcppArmadillo

#include <mlpack/methods/naive_bayes/naive_bayes_classifier.hpp>    // particular algorithm used here

// [[Rcpp::depends(RcppMLPACK)]]

// [[Rcpp::export]]
arma::irowvec naiveBayesClassifier(const arma::mat& train,
                                   const arma::mat& test,
                                   const arma::irowvec& labels,
                                   const int& classes) {

    // MLPACK wants Row<size_t> which is an unsigned representation
    // that R does not have
    arma::Row<size_t> labelsur, resultsur;

    // TODO: check that all values are non-negative
    labelsur = arma::conv_to<arma::Row<size_t>>::from(labels);

    // Initialize with the default arguments.
    // TODO: support more arguments>
    mlpack::naive_bayes::NaiveBayesClassifier<> nbc(train, labelsur, classes);

    nbc.Classify(test, resultsur);

    arma::irowvec results = arma::conv_to<arma::irowvec>::from(resultsur);

    return results;
}

I also placed the (locally rendered) version here for now

eddelbuettel commented 7 years ago

Expanded the second example.

Example: Naive Bayes Classifier

A second examples shows the NaiveBayesClassifier class.

#include <RcppMLPACK.h>             // MLPACK, Rcpp and RcppArmadillo

#include <mlpack/methods/naive_bayes/naive_bayes_classifier.hpp>    // particular algorithm used here

// [[Rcpp::depends(RcppMLPACK)]]

// [[Rcpp::export]]
arma::irowvec naiveBayesClassifier(const arma::mat& train,
                                   const arma::mat& test,
                                   const arma::irowvec& labels,
                                   const int& classes) {

    // MLPACK wants Row<size_t> which is an unsigned representation
    // that R does not have
    arma::Row<size_t> labelsur, resultsur;

    // TODO: check that all values are non-negative
    labelsur = arma::conv_to<arma::Row<size_t>>::from(labels);

    // Initialize with the default arguments.
    // TODO: support more arguments>
    mlpack::naive_bayes::NaiveBayesClassifier<> nbc(train, labelsur, classes);

    nbc.Classify(test, resultsur);

    arma::irowvec results = arma::conv_to<arma::irowvec>::from(resultsur);

    return results;
}

We need a quick helper function to get test data, again mimicking the unit tests:

#include <RcppMLPACK.h>             // MLPACK, Rcpp and RcppArmadillo

#include <mlpack/methods/naive_bayes/naive_bayes_classifier.hpp>    // particular algorithm used here

// [[Rcpp::depends(RcppMLPACK)]]

// [[Rcpp::export]]
Rcpp::List getData(const char* trainFilename, const char* testFilename) {
    arma::mat trainData, testData;
    mlpack::data::Load(trainFilename, trainData, true); // note implicit transpose
    mlpack::data::Load(testFilename, testData, true);

    // Get the labels, then remove them from data
    arma::rowvec trainlabels = trainData.row(trainData.n_rows -1);
    arma::rowvec testlabels = testData.row(testData.n_rows -1);
    trainData.shed_row(trainData.n_rows - 1);
    testData.shed_row(trainData.n_rows - 1);
    return(Rcpp::List::create(Rcpp::Named("trainData")   = Rcpp::wrap(trainData),
                              Rcpp::Named("testData")    = Rcpp::wrap(testData),
                              Rcpp::Named("trainlabels") = trainlabels,
                              Rcpp::Named("testlabels")  = testlabels));
}

Now that we can fetch the data from R, and use it to call the classifier:

rl <- getData("/home/edd/git/mlpack/src/mlpack/tests/data/trainSet.csv", # should add to RcppMLACK2
              "/home/edd/git/mlpack/src/mlpack/tests/data/testSet.csv")
trainData <- rl[["trainData"]]
testData <- rl[["testData"]]
trainlabels <- rl[["trainlabels"]]
testlabels <- rl[["testlabels"]]
res <- naiveBayesClassifier(trainData, testData, trainlabels, 2)
## res was a rowvector but comes back as 1-row matrix                                       
all.equal(res[1,],  testlabels)

As we can see, the computed classification on the test set corresponds to the expected classification in testlabels.

coatless commented 7 years ago
  1. macOS comment is accurate. Hopefully, that can be updated at a later time.
  2. May wish to give a short preview of the data for each example
    • matrix(c(1, 2, 3, 1, 2, 3), nrow=2, byrow=TRUE)
    • head(trainData) ...
eddelbuettel commented 7 years ago

Was fighting with the data and found that whole aspect ... cumbersome. mlpack is a little weird as it tranposes.

I think next step is to bring that example data set into the package, with a help page etc pp. But not today.

MHenderson commented 7 years ago

Looks good. I'm happy to contribute more examples, if that would be helpful.

MHenderson commented 7 years ago

Personally, I'd like to see an example with the testing step done with a model that was previously trained and saved to disk. Like they do in the command-line tutorial: http://www.mlpack.org/docs/mlpack-2.1.1/doxygen.php?doc=lrtutorial.html#linreg_ex3_lrtut

eddelbuettel commented 7 years ago

Closing the ticket as the file was posted. Will file new issue on need for more docs etc pp