microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.66k stars 3.83k forks source link

[Feature] R package #19

Closed kirillseva closed 7 years ago

kirillseva commented 8 years ago

Was actually surprised to see that you haven't got one already, given that MS owns revolution analytics.

Granted, some of the multi-node features will be hard to integrate into R, but OpenMPI-backed single node training looks promising too and I'd like to compare it to xgboost in our production workflow.

Let me know if you guys have internal plans for building an R package or whether you're waiting for the community to pitch in

guolinke commented 8 years ago

We have plan for building R packages, and also welcome community to contribute. This feature will be finished soon since we receive many requests about it.

BTW, can you share your ideas and demands about R packages? Thanks.

Allardvm commented 8 years ago

Developing the package itself may not even be necessary, although it would certainly help adoption. The Julia XGBoost library uses XGBoost's convenient shared library bindings and it would be great to have something similar to use for a Julia wrapper to LightGBM (although C++ is bit harder to interface with from Julia).

mxbi commented 8 years ago

There's a pull request for a basic R wrapper here: #18 I've also made a python wrapper myself - if there's interest I can make a pull request?

But it would be great if there was some way to directly generate the binary format from e.g. a NumPy array, as right now having to write to CSV and read from it in LightGBM is a slow solution.

kirillseva commented 8 years ago

@mxbi it's not very serious :) things like https://github.com/Microsoft/LightGBM/pull/18/files#diff-fea194ac8c906aa3b18f13b0672ac4e8R54 https://github.com/Microsoft/LightGBM/pull/18/files#diff-fea194ac8c906aa3b18f13b0672ac4e8R68 can make one worried :/

R has a very good interface to interact with C++ code, all you need to do is bundle R package and C++ code together in one repo (unfortunately), and write a couple of Rcpp functions that take data.frame or matrix as input, convert it to a format acceptable for lightgbm, train a model and make it accessible for predictions. DMLC folks would know more about making this happen and maintaining such project than I would.

@guolinke as for the requirements I think it's pretty standard:

Additional brownie points go for

I look forward to your initial package release and would love to help with R wrappers! If you're indeed that much faster than xgboost with comparable accuracy that would mean a lot to us 👍

Allardvm commented 8 years ago

Yes, transmitting data efficiently is also the first thing that comes to my mind.

In general, it would be useful to have a way of interfacing with LightGBM during training iterations (i.e. pausing after an iteration and waiting for further commands). This should be enough to develop quite efficient and customizable K-fold cross-validation tools, without having LightGBM write its model to disk (or serializing) after each iteration and loading the model and data again for the next.

@kirillseva summarizes the most important requirement quite nicely. This also pretty much what XGBoost offers with its shared libraries.

chivee commented 8 years ago

there are so many common needs between R and Python wrappers, I'm not familiar with R, but I guess we should exporting our training logic to some extern C API. which can be used by most of functional programming languages.

@ycdoit maybe you can share us some ideas. @guolinke , we can start from an general Model/Data container so that it can be easily porting from other tools

kirillseva commented 8 years ago

@chivee this is what you'd typically use for tying together C++ classifiers with R shell

yanyachen commented 8 years ago

Besides some features that @kirillseva already mentioned, There are some others on my mind and might be useful for modeling, and these features may be not limited to the R package.

I think @fakyras and @Far0n can contribute more ideas on this

Laurae2 commented 8 years ago

For R currently I have implemented:

Currently it works using CSV files and uses tons of arguments (and a lot of "hacky ways" of doing things) which is not a good solution but it does its job properly. I plan to add more in the near future (param list, SVMLight format for sparse matrices, feature interaction finder...), but it is clearly not an appropriate long term solution.

It would be great if we get a proper way to output a data.table / data.frame / matrix / dgCMatrix (sparse matrix, column compressed) to a binary file acceptable for LightGBM, in memory, and even better if via Rcpp (so plugging to the internals should be "easier").

fwrite vs saveRDS timing difference is very large (if not extreme). I remember exporting a 150GB data.table to CSV on a PCI-E SSD (2GB+/s) and training a very small model faster than just storing the table using saveRDS alone. But clearly the best I/O wise would be all in memory to benefit from RAM speed.

guolinke commented 8 years ago

@Laurae2 Good job, thank you.
We will expose interfaces for Python/R package soon, especially for the interfaces of in-memory data convert.

dselivanov commented 8 years ago

IMHO we can work on R package only after core developers will expose C/C++ API for python/R. Solutions which wraps command line interface are not flexible/fast/interactive.

bwilbertz commented 7 years ago

Hi everybody.

I've bundled my Rcpp bindings for LightGBM into a small R package.

Have a look at RLightGBM and enjoy testing :-)

ekerazha commented 7 years ago

bwilbertz did a good job, unfortunately it lacks several features (e.g. early stopping with eval set).

guolinke commented 7 years ago

Official R-package is under development now. I finished the basic wrapper of Dataset and Booster. Currently, it can handle the lower level training (not tested). And I will finish the high level interface soon.

However, It is my first time to use R. So I want to ask for some suggestions and comments. Check out https://github.com/Microsoft/LightGBM/tree/r-package and feel free to give comments. Thanks!

dselivanov commented 7 years ago

@guolinke took a look to lightgbm_R.cpp. Why are you not using Rcpp? It is much more easy to work with and maintain. License?

guolinke commented 7 years ago

@dmahugh I want to try with Rcpp at the first, but it seems have many issues. For providing stable package, I go back to the low level R extension. And working with R extension is not so heavy as well. However, current implementation is based on R6, it also have some problems. So I may change Dataset and Booster to c++ object .

I forget to add License, will add soon.

dselivanov commented 7 years ago

@guolinke I thought you are not using Rcpp because of GNU v2 license compared to MIT license of LightGBM. In fact Rcpp is very stable and heavily tested by 800+ packages which rely on it...

guolinke commented 7 years ago

@dselivanov I was wrong. Using Rcpp need to link to its code(header files). So it still needs GPL license. http://softwareengineering.stackexchange.com/questions/254737/does-an-rcpp-dependent-package-require-a-gpl-license https://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-FAQ.pdf

And using R extensions also need to link to GPL's code: https://stat.ethz.ch/pipermail/r-help/2011-July/283188.html

guolinke commented 7 years ago

due to license issue, I will write an simple ctypes-like function for R object to avoid include R`s header file. delete the r-package branch for now, will add back when finish this part.

dselivanov commented 7 years ago

@guolinke I personally think that GPL>=2 will be ok... R itself GPL v2, so I guess it shouldn't be a problem for people which use R itself.

guolinke commented 7 years ago

@dselivanov We have strict policy for using GPL license... I also make some progress about access raw R object in C, will go back soon.

guolinke commented 7 years ago

Finish almost all features except CV(will add soon). welcome to have a try and open PR to refine it.

guolinke commented 7 years ago

close this. open the new issues if need other features in R-package.

I also want to call one or two community member to help maintain R-package, including refine R codes/documents, release package on CRAN and solve issues/PR related to R. Feel free to contact me if you like to do it.