imbs-hl / ranger

A Fast Implementation of Random Forests
http://imbs-hl.github.io/ranger/
772 stars 193 forks source link

Does the C++ implementation support sparse data? How to read them in? #352

Open UnixJunkie opened 6 years ago

UnixJunkie commented 6 years ago

Hello,

Very nice software. I will give it a try and may write a thin OCaml wrapper if it works well (I will cite it also).

Hence, I have several questions:

So, this is more a request for some more documentation than a real issue/bug report. Hope you don't mind.

Best regards, Francois.

UnixJunkie commented 6 years ago

Currently, I am interested by classification and regression, not survival.

UnixJunkie commented 6 years ago

related to https://github.com/imbs-hl/ranger/issues/305 for a simple usage example with related example input file

UnixJunkie commented 6 years ago

If you plan to support a sparse file format, I recommend the CSR file format. For example: https://github.com/UnixJunkie/orrandomForest/blob/master/data/Boston_test_features.csr each entry in a line is a column index ':' the value for that feature index for the current line. All other entries are assumed to be 0.

mnwright commented 6 years ago

We have support for sparse data - but only in the R version. It's very easy to use, see https://github.com/imbs-hl/ranger/issues/135#issuecomment-293284786 for an example.

It's probably not that hard to include sparse data in the pure C++ version. We already have a DataSparse class using Eigen, see https://github.com/imbs-hl/ranger/blob/master/src/DataSparse.h and https://github.com/imbs-hl/ranger/blob/master/src/DataSparse.cpp. We just have to fill that with some data. Unfortunately I don't have the time for this at the moment. Feel free to create a pull request. ;)

Btw., these to files are under GPL license because they are currently used only in the R version. If required I can change them to MIT, I don't see any GPL dependencies there.

Regarding the example file, you already found #305. I have renamed that issue.

UnixJunkie commented 6 years ago

Can you point me to the code that does the data file reading for the C++ version? I guess that's where I should make changes to support a new format. I might have a look at it, but I'm doubtful I can contribute such a big feature. My C++ is all rotten also.

mnwright commented 6 years ago

That's here: https://github.com/imbs-hl/ranger/blob/9490a26e1d11ec57949db033c992ebf3a631a2a9/src/Data.cpp#L43