rcppmlpack / rcppmlpack2

Rcpp Interface to mlpack (version 2.1.0 and up)
GNU General Public License v2.0
24 stars 9 forks source link

API commonalities ? #12

Open eddelbuettel opened 7 years ago

eddelbuettel commented 7 years ago

How do we want the top-level functions?

Even in the two uses case in the gallery we have one with test data (the naive Bayes classifier) and one without. Should be always have a Rcpp::Nullable<> test set?

MHenderson commented 7 years ago

Personally, I would like to be able to train models without having to also provide a test set. The Rcpp::Nullable<> test set approach seems to be pretty reasonable to me as long as there is another function that can be used to score test data with previously trained models (like predict).

eddelbuettel commented 7 years ago

Yes, we will have to work out about how to do that. In general, in something like

// [[Rcpp::export]]
arma::irowvec naiveBayesClassifier(const arma::mat& train,
                                   const arma::mat& test,
                                   const arma::irowvec& labels,
                                   const int& classes) {

    // MLPACK wants Row<size_t> which is an unsigned representation
    // that R does not have
    arma::Row<size_t> labelsur, resultsur;

    // TODO: check that all values are non-negative
    labelsur = arma::conv_to<arma::Row<size_t>>::from(labels);

    // Initialize with the default arguments.
    // TODO: support more arguments>
    mlpack::naive_bayes::NaiveBayesClassifier<> nbc(train, labelsur, classes);

    nbc.Classify(test, resultsur);

    arma::irowvec results = arma::conv_to<arma::irowvec>::from(resultsur);

    return results;
}

it is not yet obvious to me how to return the whole nbc object back to R so the user can return it with fresh data for a test. We can try via Rcpp::XPtr, we'll need to see how much work it is. For the time being adding Rcpp::Nullable<> and skipping the test when no data is supplied is easier.

MHenderson commented 7 years ago

Yes, I agree. I think the Rcpp::Nullable<> approach is a good start.

MHenderson commented 7 years ago

Is the plan to have (roughly) a top-level function per MLPACK executable? http://mlpack.org/docs/mlpack-2.1.1/man.html

eddelbuettel commented 7 years ago

"Probably". We have enough existence proofs that this works. We don't "have to" wrap everything. Maybe the value proposition lies in providing basic code and infrastructure so that people can extend locally etc pp. More examples and tests is better than fewer. To be seen. Whether we go by executable, or main class, or ... is open. RQuantLib is over ten years old and still hasn't wrapped all that much. But people can extend locally as needed. But that is where the value added is. Open for suggestions.

eddelbuettel commented 7 years ago

This has been dormant for a bit but I plan to announce RcppMLPACK 'real soon now' -- maybe Monday. It all works well with MLPACK 2.2.0 for which I have Ubuntu binaries for 14.04 (used by Travis), 16.04 and 16.10 in my PPA. I'll update the README accordingly.

I may not get to writing new examples this weekend but if any of you has something pending...

coatless commented 7 years ago

Apologies for not seeing the ticket earlier.

I'm not sure we should construct a function interface with Rcpp::Nullable for the test data. I have a couple of reasons for this:

  1. separation of concerns from training to predicting e.g. lm() trains and predict() predicts.
  2. shifting amount of returned properties (e.g. 1 vs. 2)

c.f. /src/logisticRegression.cpp

mat <- t(trainSet[, -5])     ## train data, transpose and removing class labels
lab <- trainSet[, 5]         ## class labels for train set
logisticRegression(mat, lab)

testMat <- t(testSet[, -5])  ## test data
logisticRegression(mat, lab, testMat)

I feel like it may be slightly better to surface these classes with Modules. However, due to how many classes exist, this might cause a performance issue. So, perhaps RcppR6 should be used to surface the classes?


The other aspect is the naming of functions. Would it be better to mimic the mlpack's format of snake case or camelcase?

Given this, the top-level functions that should ship are:

eddelbuettel commented 7 years ago