[Feature] R package - Githubissues

kirillseva commented 8 years ago

Was actually surprised to see that you haven't got one already, given that MS owns revolution analytics.

Granted, some of the multi-node features will be hard to integrate into R, but OpenMPI-backed single node training looks promising too and I'd like to compare it to xgboost in our production workflow.

Let me know if you guys have internal plans for building an R package or whether you're waiting for the community to pitch in

guolinke commented 8 years ago

We have plan for building R packages, and also welcome community to contribute. This feature will be finished soon since we receive many requests about it.

BTW, can you share your ideas and demands about R packages? Thanks.

Allardvm commented 8 years ago

Developing the package itself may not even be necessary, although it would certainly help adoption. The Julia XGBoost library uses XGBoost's convenient shared library bindings and it would be great to have something similar to use for a Julia wrapper to LightGBM (although C++ is bit harder to interface with from Julia).

mxbi commented 8 years ago

There's a pull request for a basic R wrapper here: #18 I've also made a python wrapper myself - if there's interest I can make a pull request?

But it would be great if there was some way to directly generate the binary format from e.g. a NumPy array, as right now having to write to CSV and read from it in LightGBM is a slow solution.

kirillseva commented 8 years ago

@mxbi it's not very serious :) things like https://github.com/Microsoft/LightGBM/pull/18/files#diff-fea194ac8c906aa3b18f13b0672ac4e8R54 https://github.com/Microsoft/LightGBM/pull/18/files#diff-fea194ac8c906aa3b18f13b0672ac4e8R68 can make one worried :/

R has a very good interface to interact with C++ code, all you need to do is bundle R package and C++ code together in one repo (unfortunately), and write a couple of Rcpp functions that take data.frame or matrix as input, convert it to a format acceptable for lightgbm, train a model and make it accessible for predictions. DMLC folks would know more about making this happen and maintaining such project than I would.

@guolinke as for the requirements I think it's pretty standard:

be able to train a classifier by passing in native R objects (data.frames, matrices, strings and stuff)
serialize the classifier (ideally saveRDS(cls, output_file) would just work, but lightGBM::save_booster(cls, output_file to save in your custom format is fine too)
be able to predict on a data.frame or matrix and return numeric scores

Additional brownie points go for

feature importance
partial dependency plots

I look forward to your initial package release and would love to help with R wrappers! If you're indeed that much faster than xgboost with comparable accuracy that would mean a lot to us 👍

Allardvm commented 8 years ago

Yes, transmitting data efficiently is also the first thing that comes to my mind.

In general, it would be useful to have a way of interfacing with LightGBM during training iterations (i.e. pausing after an iteration and waiting for further commands). This should be enough to develop quite efficient and customizable K-fold cross-validation tools, without having LightGBM write its model to disk (or serializing) after each iteration and loading the model and data again for the next.

@kirillseva summarizes the most important requirement quite nicely. This also pretty much what XGBoost offers with its shared libraries.

chivee commented 8 years ago

there are so many common needs between R and Python wrappers, I'm not familiar with R, but I guess we should exporting our training logic to some extern C API. which can be used by most of functional programming languages.

@ycdoit maybe you can share us some ideas. @guolinke , we can start from an general Model/Data container so that it can be easily porting from other tools

kirillseva commented 8 years ago

@chivee this is what you'd typically use for tying together C++ classifiers with R shell

yanyachen commented 8 years ago

Besides some features that @kirillseva already mentioned, There are some others on my mind and might be useful for modeling, and these features may be not limited to the R package.

Callback function system (for both cv and training)
Prediction with leaf index
Model Parsing to plain text: split, gain, cover

I think @fakyras and @Far0n can contribute more ideas on this

Laurae2 commented 8 years ago

For R currently I have implemented:

Training + validation + prediction (using regular or distributed)
(repeated) Cross-validation + prediction
Output metric table
Feature importance + Plot
Cross-validated feature importance + Plot

Currently it works using CSV files and uses tons of arguments (and a lot of "hacky ways" of doing things) which is not a good solution but it does its job properly. I plan to add more in the near future (param list, SVMLight format for sparse matrices, feature interaction finder...), but it is clearly not an appropriate long term solution.

It would be great if we get a proper way to output a data.table / data.frame / matrix / dgCMatrix (sparse matrix, column compressed) to a binary file acceptable for LightGBM, in memory, and even better if via Rcpp (so plugging to the internals should be "easier").

fwrite vs saveRDS timing difference is very large (if not extreme). I remember exporting a 150GB data.table to CSV on a PCI-E SSD (2GB+/s) and training a very small model faster than just storing the table using saveRDS alone. But clearly the best I/O wise would be all in memory to benefit from RAM speed.

guolinke commented 8 years ago

@Laurae2 Good job, thank you.
We will expose interfaces for Python/R package soon, especially for the interfaces of in-memory data convert.

dselivanov commented 8 years ago

IMHO we can work on R package only after core developers will expose C/C++ API for python/R. Solutions which wraps command line interface are not flexible/fast/interactive.

bwilbertz commented 7 years ago

Hi everybody.

I've bundled my Rcpp bindings for LightGBM into a small R package.

Have a look at RLightGBM and enjoy testing :-)

ekerazha commented 7 years ago

bwilbertz did a good job, unfortunately it lacks several features (e.g. early stopping with eval set).