imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
776 stars 193 forks source link

Interface refactor #304

Open dancooke opened 6 years ago

dancooke commented 6 years ago

I would like to suggest a fairly substantial change to rangers main Forest interface. Currently, a Forest must essentially be constructed with all parameters and data, with the same interface being used for both training and prediction. This has three problems:

  1. Forest is not reusable - if I want to predict a trained Forest on multiple data sets I need to construct a new Forest for each set (and pay the price of loading the Forest each time), or manually merge all my data - which may not be desirable or feasible.
  2. It is not good from an interface perspective since it does not separate areas of concern; training and prediction are coupled. If a user just wants to predict new data on a pre-trained forest, they shouldn't need to be concerned with any parameters to do with training, but the current interface enforces that.
  3. It prohibits potential performance optimisations since Forest and Data are strongly coupled; since Forest explicitly stores the Data it will later use for training and prediction, it must store a pointer to the data and incur a virtual lookup for every data point access. However, Data has a common interface, and Forest doesn't depend on what type of data is actually used - so long as it satisfies ordering properties etc. Ideally, rather than Data being used as a polymorphic type, the underlying data would be used directly via a templateed method.

I think something along the lines of the following interface would address these points:

class Forest
{
public:
    struct Parameters { /* general parameters (num trees etc) */ };

    // Construct a new forest from provided parameters
    Forest(Parameters params);

    // Construct a new forest from a saved forest 
    Forest(std::stream& is);

    // Train the forest on the provided input data;
    // the Forest is modified (no const).
    template <typename Matrix, typename Vector>
    void train(const Matrix& X, const Vector& y);

    // Predict new data and write results to y
    template <typename Matrix, typename Vector>
    Vector& predict(const Matrix& X, Vector& y) const;

    // Write the state of the Forest to the stream
    void serialise(std::ostream& os) const;

private:
    // implementation
};

What do you think? I can have a go at implementing this if you agree.

mnwright commented 6 years ago

Generally no objections to an interface change. The current interface is mainly due to the R version. Btw., the R version is far more used (including by myself) and my major focus in development. Thus, your efforts to improve the standalone C++ version are highly appreciated!

I agree with the problems you describe (though at least 1 and 2 don't apply to the R version). As long as a new interface is working with Rcpp (I can do the changes related to that) I'm fine with a change.

mnwright commented 6 years ago

Maybe related to this: It would also be nice to be able to compute permutation variable importance for existing forests.

dlong11 commented 5 years ago

@dancooke @mnwright This sounds like a great idea. Did this ever get any traction?

mnwright commented 5 years ago

Not yet, unfortunately.

rtlprmft commented 5 years ago

A well sorted C++ API would be really helpful, as it could be linked to CERN's root package which makes I/O and plotting of results really trivial. Right now, it requires converting all data to ASCII files, running ranger, merging the prediction back to the data, converting the data back to binary to be able to create result plots.

stephematician commented 1 year ago

For what it's worth, I've been refactoring ranger as a part of my very old attempt at a multiple imputation package. (edit) The refactoring is now sitting in its own package: https://github.com/stephematician/literanger (/edit)

Some of the issues here are addressed, e.g.:

  1. The training and prediction interfaces are separated.
  2. The Forest object is re-usable and is passed to/from R as an external pointer.
  3. Forest no longer stores Data as an object; I'm thinking about updating how it is passed around (currently it is passed around as a shared resource) - the template idea is nice.

Some issues raised here are not addressed: e.g. I'm prevaricating on polymorphism (compile-time vs run-time).

I'm nowhere near the full feature set of ranger - but regardless of where I end up, my effort might be useful as a starting point for further refactoring. I switched to the cpp11 package for R as it has safer semantics than Rcpp.