Open dancooke opened 6 years ago
Generally no objections to an interface change. The current interface is mainly due to the R version. Btw., the R version is far more used (including by myself) and my major focus in development. Thus, your efforts to improve the standalone C++ version are highly appreciated!
I agree with the problems you describe (though at least 1 and 2 don't apply to the R version). As long as a new interface is working with Rcpp (I can do the changes related to that) I'm fine with a change.
Maybe related to this: It would also be nice to be able to compute permutation variable importance for existing forests.
@dancooke @mnwright This sounds like a great idea. Did this ever get any traction?
Not yet, unfortunately.
A well sorted C++ API would be really helpful, as it could be linked to CERN's root package which makes I/O and plotting of results really trivial. Right now, it requires converting all data to ASCII files, running ranger, merging the prediction back to the data, converting the data back to binary to be able to create result plots.
For what it's worth, I've been refactoring ranger as a part of my very old attempt at a multiple imputation package. (edit) The refactoring is now sitting in its own package: https://github.com/stephematician/literanger (/edit)
Some of the issues here are addressed, e.g.:
Some issues raised here are not addressed: e.g. I'm prevaricating on polymorphism (compile-time vs run-time).
I'm nowhere near the full feature set of ranger - but regardless of where I end up, my effort might be useful as a starting point for further refactoring. I switched to the cpp11 package for R as it has safer semantics than Rcpp.
I would like to suggest a fairly substantial change to rangers main
Forest
interface. Currently, aForest
must essentially be constructed with all parameters and data, with the same interface being used for both training and prediction. This has three problems:Forest
is not reusable - if I want to predict a trainedForest
on multiple data sets I need to construct a newForest
for each set (and pay the price of loading theForest
each time), or manually merge all my data - which may not be desirable or feasible.Forest
andData
are strongly coupled; sinceForest
explicitly stores theData
it will later use for training and prediction, it must store a pointer to the data and incur a virtual lookup for every data point access. However,Data
has a common interface, andForest
doesn't depend on what type of data is actually used - so long as it satisfies ordering properties etc. Ideally, rather thanData
being used as a polymorphic type, the underlying data would be used directly via atemplate
ed method.I think something along the lines of the following interface would address these points:
What do you think? I can have a go at implementing this if you agree.