API commonalities ? - Githubissues

eddelbuettel commented 7 years ago

How do we want the top-level functions?

Even in the two uses case in the gallery we have one with test data (the naive Bayes classifier) and one without. Should be always have a Rcpp::Nullable<> test set?

MHenderson commented 7 years ago

Personally, I would like to be able to train models without having to also provide a test set. The Rcpp::Nullable<> test set approach seems to be pretty reasonable to me as long as there is another function that can be used to score test data with previously trained models (like predict).

eddelbuettel commented 7 years ago

Yes, we will have to work out about how to do that. In general, in something like

// [[Rcpp::export]]
arma::irowvec naiveBayesClassifier(const arma::mat& train,
                                   const arma::mat& test,
                                   const arma::irowvec& labels,
                                   const int& classes) {

    // MLPACK wants Row<size_t> which is an unsigned representation
    // that R does not have
    arma::Row<size_t> labelsur, resultsur;

    // TODO: check that all values are non-negative
    labelsur = arma::conv_to<arma::Row<size_t>>::from(labels);

    // Initialize with the default arguments.
    // TODO: support more arguments>
    mlpack::naive_bayes::NaiveBayesClassifier<> nbc(train, labelsur, classes);

    nbc.Classify(test, resultsur);

    arma::irowvec results = arma::conv_to<arma::irowvec>::from(resultsur);

    return results;
}

it is not yet obvious to me how to return the whole nbc object back to R so the user can return it with fresh data for a test. We can try via Rcpp::XPtr, we'll need to see how much work it is. For the time being adding Rcpp::Nullable<> and skipping the test when no data is supplied is easier.

MHenderson commented 7 years ago

Yes, I agree. I think the Rcpp::Nullable<> approach is a good start.

MHenderson commented 7 years ago

Is the plan to have (roughly) a top-level function per MLPACK executable? http://mlpack.org/docs/mlpack-2.1.1/man.html

eddelbuettel commented 7 years ago

"Probably". We have enough existence proofs that this works. We don't "have to" wrap everything. Maybe the value proposition lies in providing basic code and infrastructure so that people can extend locally etc pp. More examples and tests is better than fewer. To be seen. Whether we go by executable, or main class, or ... is open. RQuantLib is over ten years old and still hasn't wrapped all that much. But people can extend locally as needed. But that is where the value added is. Open for suggestions.

eddelbuettel commented 7 years ago

This has been dormant for a bit but I plan to announce RcppMLPACK 'real soon now' -- maybe Monday. It all works well with MLPACK 2.2.0 for which I have Ubuntu binaries for 14.04 (used by Travis), 16.04 and 16.10 in my PPA. I'll update the README accordingly.

I may not get to writing new examples this weekend but if any of you has something pending...

coatless commented 7 years ago

Apologies for not seeing the ticket earlier.

I'm not sure we should construct a function interface with Rcpp::Nullable for the test data. I have a couple of reasons for this:

separation of concerns from training to predicting e.g. lm() trains and predict() predicts.
shifting amount of returned properties (e.g. 1 vs. 2)

c.f. /src/logisticRegression.cpp

mat <- t(trainSet[, -5])     ## train data, transpose and removing class labels
lab <- trainSet[, 5]         ## class labels for train set
logisticRegression(mat, lab)

testMat <- t(testSet[, -5])  ## test data
logisticRegression(mat, lab, testMat)

I feel like it may be slightly better to surface these classes with Modules. However, due to how many classes exist, this might cause a performance issue. So, perhaps RcppR6 should be used to surface the classes?

The other aspect is the naming of functions. Would it be better to mimic the mlpack's format of snake case or camelcase?

Given this, the top-level functions that should ship are:

[ ] mlpack_adaboost: train and classify with AdaBoost, an ensembling classifier
[ ] mlpack_allknn: all k-nearest neighbor search with trees
[ ] mlpack_allkfn: all k-furthest neighbor search with trees
[ ] mlpack_allkrann: rank-approximate k-nearest neighbor search with trees
[ ] mlpack_approx_kfn: approximate k-furthest neighbor search
[ ] mlpack_cf: generate recommendations via collaborative filtering
[ ] mlpack_dbscan: DBSCAN clustering
[ ] mlpack_decision_stump: classify with a decision stump
[ ] mlpack_decision_tree: classify with a decision tree
[ ] mlpack_det: density estimation trees
[ ] mlpack_emst: calculate a Euclidean minimum spanning tree
[ ] mlpack_fastmks: perform fast max-kernel search with trees
[ ] mlpack_gmm_train: train a Gaussian mixture model
[ ] mlpack_gmm_generate: generate a random sequence from a GMM
[ ] mlpack_gmm_probability: calculate the probability of a set of points coming from a given GMM
[ ] mlpack_hmm_generate: generate observations from a hidden Markov model (HMM)
[ ] mlpack_hmm_loglik: calculate the log-likelihood of some observations from an HMM
[ ] mlpack_hmm_train: train a hidden Markov model (HMM)
[ ] mlpack_hmm_viterbi: find the most probable hidden states in an HMM for some observations
[ ] mlpack_hoeffding_tree: train and classify with Hoeffding trees, a streaming decision tree for very large datasets
[ ] mlpack_kernel_pca: perform kernel principal components analysis
[ ] mlpack_kfn: all k-furthest neighbor search with trees -[x] mlpack_kmeans: perform k-means clustering
[ ] mlpack_knn: all k-nearest neighbor search with trees
[ ] mlpack_krann: rank-approximate k-nearest neighbor search with trees
[ ] mlpack_lars: least-angle regression
[ ] mlpack_linear_regression: simple least-squares linear regression
[ ] mlpack_local_coordinate_coding: local coordinate coding
[x] mlpack_logistic_regression: train or classify with logistic regression
[ ] mlpack_lsh: approximate k-nearest neighbor search with locality-sensitive hashing
[ ] mlpack_mean_shift: mean shift clustering
[x] mlpack_nbc: train or classify with the naive Bayes classifier
[ ] mlpack_nca: neighborhood components analysis
[ ] mlpack_nmf: non-negative matrix factorization
[ ] mlpack_pca: principal components analysis
[ ] mlpack_perceptron: train or classify with a perceptron
[ ] mlpack_preprocess_binarize: binarize features of a dataset
[ ] mlpack_preprocess_imputer: impute missing values of a dataset
[ ] mlpack_preprocess_describe: generative descriptive statistics for a dataset
[ ] mlpack_preprocess_split: split a dataset into a training and test set
[ ] mlpack_radical: RADICAL (independent components analysis)
[ ] mlpack_range_search: range search with trees
[ ] mlpack_softmax_regression: train or classify with softmax regression
[ ] mlpack_sparse_coding: sparse coding with dictionary learning

eddelbuettel commented 7 years ago

Modules: I'd say no. It is somewhat stale technology. Works well on limited scale and scope, don't think I would do larger things with it today.
RcppR6: Essentially vaporware. Everybody likes it, author never bothered with CRAN. Sad.
Naming: I wouldn't worry. I don't see us writing all those functions anytime soon. Snakecase is evil, camelCase is better, and all I personally aim for is not to clash. Hence the recent renaming to kMeans.

rcppmlpack / rcppmlpack2

API commonalities ? #12