Open eddelbuettel opened 7 years ago
Personally, I would like to be able to train models without having to also provide a test set. The Rcpp::Nullable<>
test set approach seems to be pretty reasonable to me as long as there is another function that can be used to score test data with previously trained models (like predict
).
Yes, we will have to work out about how to do that. In general, in something like
// [[Rcpp::export]]
arma::irowvec naiveBayesClassifier(const arma::mat& train,
const arma::mat& test,
const arma::irowvec& labels,
const int& classes) {
// MLPACK wants Row<size_t> which is an unsigned representation
// that R does not have
arma::Row<size_t> labelsur, resultsur;
// TODO: check that all values are non-negative
labelsur = arma::conv_to<arma::Row<size_t>>::from(labels);
// Initialize with the default arguments.
// TODO: support more arguments>
mlpack::naive_bayes::NaiveBayesClassifier<> nbc(train, labelsur, classes);
nbc.Classify(test, resultsur);
arma::irowvec results = arma::conv_to<arma::irowvec>::from(resultsur);
return results;
}
it is not yet obvious to me how to return the whole nbc
object back to R so the user can return it with fresh data for a test. We can try via Rcpp::XPtr
, we'll need to see how much work it is. For the time being adding Rcpp::Nullable<>
and skipping the test when no data is supplied is easier.
Yes, I agree. I think the Rcpp::Nullable<>
approach is a good start.
Is the plan to have (roughly) a top-level function per MLPACK executable? http://mlpack.org/docs/mlpack-2.1.1/man.html
"Probably". We have enough existence proofs that this works. We don't "have to" wrap everything. Maybe the value proposition lies in providing basic code and infrastructure so that people can extend locally etc pp. More examples and tests is better than fewer. To be seen. Whether we go by executable, or main class, or ... is open. RQuantLib is over ten years old and still hasn't wrapped all that much. But people can extend locally as needed. But that is where the value added is. Open for suggestions.
This has been dormant for a bit but I plan to announce RcppMLPACK 'real soon now' -- maybe Monday. It all works well with MLPACK 2.2.0 for which I have Ubuntu binaries for 14.04 (used by Travis), 16.04 and 16.10 in my PPA. I'll update the README accordingly.
I may not get to writing new examples this weekend but if any of you has something pending...
Apologies for not seeing the ticket earlier.
I'm not sure we should construct a function interface with Rcpp::Nullable
for the test data. I have a couple of reasons for this:
lm()
trains and predict()
predicts.c.f. /src/logisticRegression.cpp
mat <- t(trainSet[, -5]) ## train data, transpose and removing class labels
lab <- trainSet[, 5] ## class labels for train set
logisticRegression(mat, lab)
testMat <- t(testSet[, -5]) ## test data
logisticRegression(mat, lab, testMat)
I feel like it may be slightly better to surface these classes with Modules. However, due to how many classes exist, this might cause a performance issue. So, perhaps RcppR6
should be used to surface the classes?
The other aspect is the naming of functions. Would it be better to mimic the mlpack's format of snake case or camelcase?
Given this, the top-level functions that should ship are:
kMeans
.
How do we want the top-level functions?
Even in the two uses case in the gallery we have one with test data (the naive Bayes classifier) and one without. Should be always have a
Rcpp::Nullable<>
test set?