Closed kirillseva closed 7 years ago
We have plan for building R packages, and also welcome community to contribute. This feature will be finished soon since we receive many requests about it.
BTW, can you share your ideas and demands about R packages? Thanks.
Developing the package itself may not even be necessary, although it would certainly help adoption. The Julia XGBoost library uses XGBoost's convenient shared library bindings and it would be great to have something similar to use for a Julia wrapper to LightGBM (although C++ is bit harder to interface with from Julia).
There's a pull request for a basic R wrapper here: #18 I've also made a python wrapper myself - if there's interest I can make a pull request?
But it would be great if there was some way to directly generate the binary format from e.g. a NumPy array, as right now having to write to CSV and read from it in LightGBM is a slow solution.
@mxbi it's not very serious :) things like https://github.com/Microsoft/LightGBM/pull/18/files#diff-fea194ac8c906aa3b18f13b0672ac4e8R54 https://github.com/Microsoft/LightGBM/pull/18/files#diff-fea194ac8c906aa3b18f13b0672ac4e8R68 can make one worried :/
R has a very good interface to interact with C++ code, all you need to do is bundle R package and C++ code together in one repo (unfortunately), and write a couple of Rcpp functions that take data.frame
or matrix
as input, convert it to a format acceptable for lightgbm, train a model and make it accessible for predictions. DMLC folks would know more about making this happen and maintaining such project than I would.
@guolinke as for the requirements I think it's pretty standard:
saveRDS(cls, output_file)
would just work, but lightGBM::save_booster(cls, output_file
to save in your custom format is fine too)Additional brownie points go for
I look forward to your initial package release and would love to help with R wrappers! If you're indeed that much faster than xgboost with comparable accuracy that would mean a lot to us 👍
Yes, transmitting data efficiently is also the first thing that comes to my mind.
In general, it would be useful to have a way of interfacing with LightGBM during training iterations (i.e. pausing after an iteration and waiting for further commands). This should be enough to develop quite efficient and customizable K-fold cross-validation tools, without having LightGBM write its model to disk (or serializing) after each iteration and loading the model and data again for the next.
@kirillseva summarizes the most important requirement quite nicely. This also pretty much what XGBoost offers with its shared libraries.
there are so many common needs between R and Python wrappers, I'm not familiar with R, but I guess we should exporting our training logic to some extern C API. which can be used by most of functional programming languages.
@ycdoit maybe you can share us some ideas. @guolinke , we can start from an general Model/Data container so that it can be easily porting from other tools
@chivee this is what you'd typically use for tying together C++ classifiers with R shell
Besides some features that @kirillseva already mentioned, There are some others on my mind and might be useful for modeling, and these features may be not limited to the R package.
I think @fakyras and @Far0n can contribute more ideas on this
For R currently I have implemented:
Currently it works using CSV files and uses tons of arguments (and a lot of "hacky ways" of doing things) which is not a good solution but it does its job properly. I plan to add more in the near future (param list, SVMLight format for sparse matrices, feature interaction finder...), but it is clearly not an appropriate long term solution.
It would be great if we get a proper way to output a data.table / data.frame / matrix / dgCMatrix (sparse matrix, column compressed) to a binary file acceptable for LightGBM, in memory, and even better if via Rcpp (so plugging to the internals should be "easier").
fwrite vs saveRDS timing difference is very large (if not extreme). I remember exporting a 150GB data.table to CSV on a PCI-E SSD (2GB+/s) and training a very small model faster than just storing the table using saveRDS alone. But clearly the best I/O wise would be all in memory to benefit from RAM speed.
@Laurae2 Good job, thank you.
We will expose interfaces for Python/R package soon, especially for the interfaces of in-memory data convert.
IMHO we can work on R package only after core developers will expose C/C++ API for python/R. Solutions which wraps command line interface are not flexible/fast/interactive.
Hi everybody.
I've bundled my Rcpp bindings for LightGBM into a small R package.
Have a look at RLightGBM and enjoy testing :-)
bwilbertz did a good job, unfortunately it lacks several features (e.g. early stopping with eval set).
Official R-package is under development now. I finished the basic wrapper of Dataset and Booster. Currently, it can handle the lower level training (not tested). And I will finish the high level interface soon.
However, It is my first time to use R. So I want to ask for some suggestions and comments. Check out https://github.com/Microsoft/LightGBM/tree/r-package and feel free to give comments. Thanks!
@guolinke took a look to lightgbm_R.cpp
. Why are you not using Rcpp? It is much more easy to work with and maintain. License?
@dmahugh I want to try with Rcpp at the first, but it seems have many issues. For providing stable package, I go back to the low level R extension. And working with R extension is not so heavy as well.
However, current implementation is based on R6, it also have some problems. So I may change Dataset
and Booster
to c++ object .
I forget to add License, will add soon.
@guolinke I thought you are not using Rcpp because of GNU v2 license compared to MIT license of LightGBM. In fact Rcpp is very stable and heavily tested by 800+ packages which rely on it...
@dselivanov I was wrong. Using Rcpp need to link to its code(header files). So it still needs GPL license. http://softwareengineering.stackexchange.com/questions/254737/does-an-rcpp-dependent-package-require-a-gpl-license https://cran.r-project.org/web/packages/Rcpp/vignettes/Rcpp-FAQ.pdf
And using R extensions also need to link to GPL's code: https://stat.ethz.ch/pipermail/r-help/2011-July/283188.html
due to license issue, I will write an simple ctypes-like function for R object to avoid include R`s header file. delete the r-package branch for now, will add back when finish this part.
@guolinke I personally think that GPL>=2 will be ok... R itself GPL v2, so I guess it shouldn't be a problem for people which use R itself.
@dselivanov We have strict policy for using GPL license... I also make some progress about access raw R object in C, will go back soon.
Finish almost all features except CV(will add soon). welcome to have a try and open PR to refine it.
close this. open the new issues if need other features in R-package.
I also want to call one or two community member to help maintain R-package, including refine R codes/documents, release package on CRAN and solve issues/PR related to R. Feel free to contact me if you like to do it.
Was actually surprised to see that you haven't got one already, given that MS owns revolution analytics.
Granted, some of the multi-node features will be hard to integrate into R, but OpenMPI-backed single node training looks promising too and I'd like to compare it to xgboost in our production workflow.
Let me know if you guys have internal plans for building an R package or whether you're waiting for the community to pitch in