suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

No code example #1

Closed pommedeterresautee closed 8 years ago

pommedeterresautee commented 9 years ago

R documentation doesn't include code example to execute easily. there is no demo code too.

suiji commented 9 years ago

Pommederresautee,

There is demo code in the "tests" subdirectory, admittedly minimal. The intent is to continue adding examples. If there is something specific you would like to see in the meantime, please let me know.

Thank you, Author

pommedeterresautee commented 9 years ago

Hi @suiji tks for answering.

I have seen the test but I should agree it is a bit minimal :-)

I am working on the R Xgboost package right know. I mainly do feature importance analysis with it. I have added several features to the package for that purpose. Since a long time I want to try same kind of analysis on random forest. Unfortunately none of the existing R package is able to manage important dataset (understand 4/5Gb) with > million observations and >> million of features (binary). Xgboost R package works pretty well for these dataset (someone just reported 200Gb dataset is working without using the Hadoop interface).

I have noticed you have paralelized your package and I am wondering if it can manage such sizes? If yes, I would be happy to compare feature importance analysis based on RF and on boosting.

Therefore a basic example with real data in the documentation would help a lot to rapidly test big sizes with my own data.

Kind regards, Michael

suiji commented 9 years ago

Michael,

The memory footprint is about 10x, so a 5GB data set should be well within the reach of a modern server configuration. If the data is compressed, however, it is possible that you may exhaust memory. We have tested versions of the Arborist with order 10^8 rows and 10-20 predictors, as well as 10^3 rows with 10^6 predictors, without incident. There is no a priori reason that order 10^6 x 10^6 should fail.

The current implementation is in-memory, but longer-term goals include out-of-memory solutions, cluster support, usw. It's mainly a matter of people using the software in more diverse situations and asking for these features.

Can we continue our conversation by private e-mail, in order to collaborate on suitable examples?

Thank you, Mark Seligman

suiji commented 8 years ago

Documentation has been revised to include examples of all options and return types. Vignette projects are ongoing.

Closing, but feel free to re-open with related concerns.