suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

any progress of python wrapper #6

Closed fyears closed 7 years ago

fyears commented 8 years ago

Hi,

I just discover your project and I am interested in it. I notice that you provide the R wrapper but no Python wrapper now.

Is there any Python wrapper available now? Or is it unwritten yet? Actually I am interested in participate in this project and writing the Python wrapper, if you are still focusing on R wrapper now. :-)

suiji commented 8 years ago

There was a student scoping out the Python bridge last fall. He became busy with other commitments, though, so the project has moved back onto the TODO list. We could definitely use help with the Python wrapper.

The current plan is to implement the bridge using Cython, compiling it together with the Core code into a single shared library. This is how the R bridge is implemented - C++/Rcpp in that case - and we would like to follow the same model for all supported front ends, if possible. The bridge is a collection of fairly boring subroutines whose only task is to massage front-end data structures to and from their Core counterparts. We keep a lean front-end interface, as well, whose main task is argument checking.

Let me know if that sounds interesting, and thank you for the help.

fyears commented 8 years ago

That's exactly what I am thinking: a lean front-end and a language-unrelated back-end.

I believe the python bridge could be done via cffi (C++ interface to C interface needed though), and I think it is also nice to follow the interface style of scikit-learn.

I will have a deeper investigate on the project these days, and hopefully I would be able to get my hands dirty ASAP (because I am more familiar with Python than C++). :-)

suiji commented 8 years ago

CFFI is a possibility. The interface to the Core code employs STL vectors, so probably leans more toward C++ than it does toward C. We had at one point planned to use CFFI but Cython seemed to make more sense at the time. We can certainly revisit this. My only real concern is that the interface compiles with the Core into a single shared object, and does not make demands of the Core implementation. If you can do this while honoring the scikit-learn interface, then so much the better.

fyears commented 8 years ago

So now I am actually trying to do it (slowly though). How about calling it PyBorist. :-) Here are some of my thoughts and questions, could you provide some suggestions please?

I have walked through some of your source code (core and RCpp), and now I want to implement python regression interface firstly. Of course the most important functions are train and predict. So in my brain the whole picture is something like:

numpy array X and y
    -> use numpy cffi/ctypes interface to cast X and y into c style array
    -> use <vector> to cast the c style array to cpp vector
    -> call train::Init() and train::Regression() from cpp
    -> some cpp object as the trained model, storing all information 
        #TODO what and where in cpp and rcpp? what does it store?
    -> the python object ArboristModel, pointing to that cpp model object

ArboristModel and X_new
    -> use numpy cffi/ctypes interface to cast X_new into c style array
    -> use <vector> to cast the c style array to cpp vector
    -> call predict::Regression() from cpp, 
        using X_new_cpp and the cpp model inside ArboristModel
    -> cast the output vector from cpp vector to c array to numpy array

Is that sufficiently enough? But obviously your R interface does something more than that...

suiji commented 8 years ago

On 05/03/2016 10:15 PM, fyears wrote:

So now I am actually trying to do it (slowly though). How about calling it |PyBorist|. :-) Here are some of my thoughts and questions, could you provide some suggestions please?

Let's take this offline, in future, but here are some quick responses.
In fact, Pyborist was the name we proposed to use.

I have walked through some of your source code (core and RCpp), and now I want to implement python regression interface firstly. Of course the most important functions are |train| and |predict|. So in my brain the whole picture is something like:

numpy array X and y -> use numpy cffi/ctypes interface to cast X and y into c style array -> use to cast the c style array to cpp vector -> call train::Init() and train::Regression() from cpp All sounds doable within Python bridge, Why not just use Cython, which is supposed to be able to handle C++ directly.? -> some cpp object as the trained model, storing all information #TODO what and where in cpp and rcpp? what does it store? Core builds several vectors, some consisting of packed structures and some containing basic types. See class ForestNode, in particular. In particular, Rcpp simply wraps Core-defined data structures when there is no equivalent basic type.
-> the python object ArboristModel, pointing to that cpp model object
Rborist (i.e., the R implementation) stores the trained model in memory allocated by the running R session. That is, Core fills in vectors supplied by the front end, so there really isn't a c++ model object.
It's up to the respective front end to wrap together the components.
ArboristModel and X_new -> use numpy cffi/ctypes interface to cast X_new into c style array -> use to cast the c style array to cpp vector -> call predict::Regression() from cpp, using X_new_cpp and the cpp model inside ArboristModel Sounds analagous to what you want to do for training. Again, though, the "ArboristModel" would need to be allocated and populated from the Python front end as a collection of packed and basic vectors. -> cast the output vector from cpp vector to c array to numpy array Same argument as above: Python supplies the raw vectors filled in by the Core, not the other way around.

Is that sufficiently enough? But obviously your R interface does something more than that...

The R interface does some argument checking and computes argument defaults. There is a piece that breaks apart the R "data.frame"; this piece should be done in the Rcpp code, ultimately. In Python there will be the analagous problem of processing the Panda data-frames.

Thanks for your help. Please feel free to follow off-line.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/suiji/Arborist/issues/6#issuecomment-216746278

suiji commented 8 years ago

Did it again! Thunderbird displays the quoted material in the sent mail, but the Github version does not display it. Repeating the last mail, with quotes intact:

numpy array X and y -> use numpy cffi/ctypes interface to cast X and y into c style array -> use to cast the c style array to cpp vector -> call train::Init() and train::Regression() from cpp

All sounds doable within Python bridge, Why not just use Cython, which is supposed to be able to handle C++ directly.? In any case, the bridge and core should be compiled together into a shared object invoked from a running front-end executable.

-> some cpp object as the trained model, storing all information

TODO what and where in cpp and rcpp? what does it store?

-> the python object ArboristModel, pointing to that cpp model object

Core builds several vectors, some consisting of packed structures and some containing basic types, that is, types supported by the front end. This will vary, of course, with the front end. See class ForestNode, for example. In particular, when there is no equivalent front-end type, Rcpp casts via its 'Xptr' mechanism, exposing it to R as an externally-defined class. We need something analagous from Python and the bridge.

-> the python object ArboristModel, pointing to that cpp model object

Rborist (i.e., the R implementation) stores the trained model in memory allocated by the running R session. That is, Core fills in vectors supplied by the front end, so there really isn't a c++ model object. It's up to the respective front end to wrap together the components.

ArboristModel and X_new

By 'X_new', I assume you mean separately-tested data?

-> use numpy cffi/ctypes interface to cast X_new into c style array -> use to cast the c style array to cpp vector -> call predict::Regression() from cpp, using X_new_cpp and the cpp model inside ArboristModel

Sounds analagous to what you want to do for training. Again, though, the "ArboristModel" would need to be allocated and populated from the Python front end as a collection of packed and basic vectors.

-> cast the output vector from cpp vector to c array to numpy array

Same argument as above: Python supplies the raw vectors filled in by the Core, not the other way around.

Is that sufficiently enough? But obviously your R interface does something more than that...

The R interface does some argument checking and computes default values. There is a piece that breaks apart the R "data.frame"; this piece will eventually be implemented by the bridge code instead. In Python there will be the analagous problem of processing the Panda data-frames.

Thanks for your help. Please feel free to follow off-line.

fyears commented 8 years ago
  1. So how to contact you off-line? info # suiji.org?
  2. Yes, cython is the best and similar to RCpp AFAIK, we should use it. ), and currently I am just trying to map the cpp class to python class as cython suggests. And yes I know the R interface does some checking but I am concentrate on the "legal input" now.
  3. Currently I start working at my fork (far from completeness and not runnable now...). But if you would like to set up a github organization account and assign me the privilege to manipulate the repo, that will be great. (i am not quite sure whether @suiji is your personal github account though.)
  4. Would you agree that we release the python bridge under MPL or BSD or MIT license instead of GPL?
suiji commented 8 years ago

Yes, please start a thread on info@suiji.org, and I will cc to an internal address.

MPL-2 is fine for the bridge, as is GPL. I am not as familiar with the other licenses.

On 05/04/2016 02:47 PM, fyears wrote:

1.

So how to contact you off-line? |info # suiji.org|?

2.

Yes, |cython| is the best and similar to |RCpp| AFAIK, we should
use it. ), and currently I am just trying to map the cpp class to
python class as |cython| suggests. And yes I know the R interface
does some checking but I am concentrate on the "legal input" now.

3.

Currently I start working at my fork
<https://github.com/fyears/Arborist> (far from completeness and
not runnable now...). But if you would like to set up a github
organization account and assign me the privilege to manipulate the
repo, that will be great. (i am not quite sure whether @suiji
<https://github.com/suiji> is your personal github account though.)

4.

Would you agree that we release the python bridge under MPL or BSD
or MIT license instead of GPL?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/suiji/Arborist/issues/6#issuecomment-217013753

ctbrown commented 8 years ago

Please, please keep discussion relevant to this thread online and not offline. There are those of us following the discussion, though not actively participating.

FWIW, Cython make more sense to me.

C-

On Thu, May 5, 2016 at 9:24 AM, suiji notifications@github.com wrote:

Yes, please start a thread on info@suiji.org, and I will cc to an internal address.

MPL-2 is fine for the bridge, as is GPL. I am not as familiar with the other licenses.

On 05/04/2016 02:47 PM, fyears wrote:

1.

So how to contact you off-line? |info # suiji.org|?

2.

Yes, |cython| is the best and similar to |RCpp| AFAIK, we should use it. ), and currently I am just trying to map the cpp class to python class as |cython| suggests. And yes I know the R interface does some checking but I am concentrate on the "legal input" now.

3.

Currently I start working at my fork https://github.com/fyears/Arborist (far from completeness and not runnable now...). But if you would like to set up a github organization account and assign me the privilege to manipulate the repo, that will be great. (i am not quite sure whether @suiji https://github.com/suiji is your personal github account though.)

4.

Would you agree that we release the python bridge under MPL or BSD or MIT license instead of GPL?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/suiji/Arborist/issues/6#issuecomment-217013753

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/suiji/Arborist/issues/6#issuecomment-217200977

suiji commented 8 years ago

O.k., will do.

Thanks.

On 05/05/2016 10:29 AM, Christopher Brown wrote:

Please, please keep discussion relevant to this thread online and not offline. There are those of us following the discussion, though not actively participating.

FWIW, Cython make more sense to me.

C-

On Thu, May 5, 2016 at 9:24 AM, suiji notifications@github.com wrote:

Yes, please start a thread on info@suiji.org, and I will cc to an internal address.

MPL-2 is fine for the bridge, as is GPL. I am not as familiar with the other licenses.

On 05/04/2016 02:47 PM, fyears wrote:

1.

So how to contact you off-line? |info # suiji.org|?

2.

Yes, |cython| is the best and similar to |RCpp| AFAIK, we should use it. ), and currently I am just trying to map the cpp class to python class as |cython| suggests. And yes I know the R interface does some checking but I am concentrate on the "legal input" now.

3.

Currently I start working at my fork https://github.com/fyears/Arborist (far from completeness and not runnable now...). But if you would like to set up a github organization account and assign me the privilege to manipulate the repo, that will be great. (i am not quite sure whether @suiji https://github.com/suiji is your personal github account though.)

4.

Would you agree that we release the python bridge under MPL or BSD or MIT license instead of GPL?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/suiji/Arborist/issues/6#issuecomment-217013753

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/suiji/Arborist/issues/6#issuecomment-217200977

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/suiji/Arborist/issues/6#issuecomment-217218122

fyears commented 8 years ago

Hi,

The current (20160519) progress:

Currently I could feed numpy arrays (X, y) into train() to generate something like trained_info and feed trained_info and X_test into predict(). But the final y_pred is kind of weird (always zero). I am still debugging. So I am not make pull request right now.

I have some questions:

  1. In C++ level, how is the _feNum, which is mathematically the numeric X matrix, represented? I mean, such that

    #python or r
    X = np.array([[1,2,3],
                 [4,5,6]])

    Is feNum equivalent as [1, 2, 3, 4, 5, 6] (row-based) or [1, 4, 2, 5, 3, 6] (column-based)?

suiji commented 8 years ago

It sounds like you are making good progress.

Concerning the 'feNum' array, storage should be row-major, as in native C/C++. This may be counter to how Python and other front ends hold the data, but it is suitable for the sorts of things the Core code must do.

Regards, mls

On 2016-05-19 20:40, fyears wrote:

Hi,

The current (20160519) progress:

Currently I could feed numpy arrays (X, y) into train() to generate something like trained_info and feed trained_info and X_test into predict(). But the final y_pred is kind of weird (always zero). I am still debugging. So I am not make pull request right now.

I have some questions:

*

In C++ level, how is the _feNum, which is mathematically the numeric X matrix, represented? I mean, such that

python or r

X = np.array([[1,2,3], [4,5,6]])

Is feNum equivalent as 1, 2, 3, 4, 5, 6 or 1, 4, 2, 5, 3, 6?

You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub [1]

*

Links:

[1] https://github.com/suiji/Arborist/issues/6#issuecomment-220510037

fyears commented 8 years ago

@ctbrown @suiji

Hi,

Just to let you know that it's possible to run the Python version now! :smile:

If you are interested, please look at my repo and:

pip install numpy
pip install 'cython>=0.24'

git clone https://github.com/fyears/Arborist.git
cd Arborist/ArboristBridgePy/
python setup.py install

Then open python and

from pyborist import PyboristClassifier, PyboristRegressor

So there you go.

Some issues:

  1. In train.cc and predict.cc, min() and max() functions are not defined in Windows msvc. But Linux g++ meets no problem. (why?) If install in Windows, I have to add #include<algorithm> or some macros manually before running python setup.py install.
  2. Windows msvc openmp will raise error C3016: index variable in OpenMP 'for' statement must have signed integral type. So openmp is not enabled in Windows. To fix this, maybe any unsigned int variables in openmp for loop should be changed to something like long. Currently 36 code snippets have #pragma.
  3. No validation now. No quantiles regression now.
  4. I release the pyborist in MIT License because I think it is more flexible and friendly. I strongly recommend you switch the license of AboristCore from MPL to MIT License. Why? Compared with MPL, MIT is much simpler, shorter, and is more friendly to commercial usage (though I do not have this personal plan). As long as both of you agree you could simply change the license because right now only you two are contributors to core source code.

Best wishes.

suiji commented 8 years ago

Fyears,

Thanks very much for your contribution. We will review the changes but, since they are confined to the Bridge, there should not be a problem with the pull.

Let me see whether including causes any trouble in the Core. If not, I will make the corresponding change there. I'm also curious about OpenMP's not liking the unsigned type: could simply be my own ignorance, of course.

Do you want your real name included in our acknowledgments, or do you prefer to remain anonymous?

Regards, mls

On 2016-05-21 21:38, fyears wrote:

@ctbrown [1] @suiji [2]

Hi,

Just to let you know that it's possible to run the Python version now! 😄

If you are interested, please look at my repo [3] and:

pip install numpy pip install 'cython>=0.24'

git clone https://github.com/fyears/Arborist.git cd Arborist/ArboristBridgePy/ python setup.py install

Then open python and

from pyborist import PyboristClassifier, PyboristRegressor

So there you go.

Some issues:

*

In train.cc and predict.cc, min() and max() functions are not defined in Windows msvc. But Linux g++ meets no problem. (why?) If install in Windows, I have to add #include or some macros manually before running python setup.py install. *

Windows msvc openmp will raise error C3016: index variable in OpenMP 'for' statement must have signed integral type. So openmp is not enabled in Windows. To fix this, maybe any unsigned int variables in openmp for loop should be changed to something like long. Currently 36 code snippets have #pragma. *

No validation now. No quantiles regression now. *

I release the pyborist in MIT License because I think it is more flexible and friendly. I strongly recommend you switch the license of AboristCore from MPL to MIT License. Why? Compared with MPL, MIT is much simpler, shorter, and is more friendly to commercial usage (though I do not have this personal plan). As long as both of you agree you could simply change the license because right now only you two are contributors to core source code.

Best wishes.

You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub [4]

*

Links:

[1] https://github.com/ctbrown [2] https://github.com/suiji [3] https://github.com/fyears/Arborist/tree/master/ArboristBridgePy [4] https://github.com/suiji/Arborist/issues/6#issuecomment-220813938

fyears commented 8 years ago
  1. Please play around it. I can just ensure that default parameters (and/or valid) meet no problems... And I successfully ran the 100k dataset from benchm-ml with default parameters.

I will attempt to run it with some smaller examples. If you have been able to run the flight-delay data sets successfully, though, then we are likely to be in good shape.

  1. I don't know where do min and max from while using gcc. Anyways I have not changed anything in core code in my github repo. I think directly really defining the functions or using macros is sufficient.

I changed to the core files to include , then recast the calls as std::min() and std::max() to be clear about which ones are being invoked. It remains to be seen how this builds with msvc. I should have an msvc development system available some time in the next month, but don't have anything handy right now. I do expect to push another round of changes to Github in the next week to ten days, however, and these changes will be included.

  1. I did some searches, and found that unsigned type in for loop is supported after openmp v3.0. But the latest msvc only support openmp v2.0. Thus you may be interested to cast something from unsigned int to long. No other way to work around it, and trust me that cygwin support for Python sucks...

Those changes have also been included in the Core, and are subject to the same delayed verification on msvc mentioned above.

You had suggested moving the Core to an MIT license. Believe it or not, this is out of my hands. The code was open-sourced following an agreement between a former business associate and me. The written agreement was that it be release under MPL-2 license. My unilaterally altering the terms of the license could be viewed as bad faith.

suiji commented 8 years ago

I have been traveling all week, and have not had an opportunity to review your changes yet. You should expect some feedback over the long weekend.

One quick question: You mentioned that validation was not ready yet. Did you mean that the trained forest cannot be tested on the out-of-bag samples?

fyears commented 8 years ago

The no_validate parameter has no effects. In other words, Rborist.default.R line #171 ~ #197 have NOT been ported (yet).

Oh one more thing, I treated all the inputs as numeric (usual way in scikit-learn) because a matrix is just a numeric matrix and no way to distinguish the factor columns from others. A considered workaround in the future is adding a new specific parameter factor_columns_indices = np.array([]) or even accepting pandas.DataFrame.

suiji commented 8 years ago

The no_validate parameter has no effects. In other words, Rborist.default.R line #171 ~ #197 have NOT been ported (yet).

Yes, this makes sense: reasonable stopping point.

Oh one more thing, I treated all the inputs as numeric (usual way in scikit-learn) because a matrix is just a numeric matrix and no way to distinguish the factor columns from others. A considered workaround in the future is adding a new specific parameter factor_columns_indices = np.array([]) or even accepting pandas.DataFrame.

We have planned all along to accommodate the Pandas DataFrame feature. This was mentioned in the PyData talk. It is also pointed out in an earlier message in this thread. It is perfectly reasonable to emulate Scikit-Learn as an initial implementation, but we would ultimately like to provide full functionality irrespective of the language front end.

suiji commented 8 years ago

Your initial release looks pretty good. I just have a couple of questions: i) Is it possible to stay closer to the format currently used for R, namely: Directory "FrontEnd" would have pure Python code. Directory "Shared" would have compiled code.

ii) The call-back pre-defines a random-variate generator: In future, will it be possible for a user running, say, NumPy, to inherit the generator already in use for the session? - i.e., as Rborist now does for R?

Regards, mls

fyears commented 8 years ago
  1. Sure, but maybe now we don't need to do that. Currently I am following the usual python project structure. Yeah of course it might be possible and logical to group all the pxd and pyx files into some folder like ArboristBridge/pyborist/corelib. I am not doing so now because I prefer a flatter project structure and currently only an extra skl.py is needed. Well, in the future, if more files are introduced, we will difinitely split the files into different folders.

Makes sense. We can make more plans as the work progresses.

(BTW, since r packages also have the guildlines that folder R is for .r source code and folder src is for c/cpp source code, you may consider just following the r style project structure......Frontend/Shared folders and an extract "rebuild-project-structure" script is kind of unusual IMO. )

It is no doubt unusual from the perspective of a single language. We are maintaining a multiple-language project, however, in which each spin is built by compiling a common core together with selected elements from a front-end bridge. Since this pattern will likely be repeated in every front end we support, I believe that the FrontEnd/Shared model for the respective bridges is highly intuitive. In particular, this is the sort of approach I have encountered in nearly every compiler project I worked on over a span of 20+ years: various companies, various languages. I am not wedded to the style, but have found it simple and useful.

  1. The key point (main difficulty) is that Python doesn't has convenient way to directly be written in c/cpp, while r has super-powered rcpp. Moreover numpy is not "core of python". Further it also needs extra cares while compiling the python package. Currently it's simple: build cpp library, build python/cython library, link them, done. But with a python-enables callback.cc, we need to compile the callback.cc with special arguments, then cpp library and link callback.o, then python/cython, then link.

Yes, this was my understanding of the Python ecosystem, as well: i.e., Numpy and Pandas are not core. For more R-like functionality, though, we will probably need to support them.

BTW: Your pull request is accepting. Thank you.

suiji commented 7 years ago

Merging to commence a new thread.