suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

State of Py-lang support #32

Open suiji opened 7 years ago

suiji commented 7 years ago

Rudimentary Python support appears to be available. Thanks go out to @mjuarezm for independently testing the code and, especially, to @fyears for the working implementation. Kudos also to Chetan Mishra, who first studied the problem.

The working implementation is built from an older version of the Core code, so suffers from some important deficiencies:

Sparse data is not supported. One-hot encodings, in particular, will run more slowly and consume much more memory than running under the R counterpart.

Restaging employs an older greedy algorithm. High-dimensional data sets will run noticeably slower in the Py versions, especially in classification.

More recent versions of the Core have cut down significantly on irregular memory accesses. In particular, the Py versions are not expected to scale as well.

A peppier version of Pyborist, then, awaits some refactoring.

suiji commented 5 years ago

With the recent rewrite of the R bridge, the task of providing a full-fledged Python interface has been greatly simplified. The existing Pyborist bridge has been tagged and archived, and Hawk is building a new bridge interfacing with version 0.2-0 of the Core.

arainboldt commented 4 years ago

Hi,

I would like to experiment with the Pyborist package. Is there a viable bridge available? I don't see any contents in the current Pyborist directory on this repo?

Best,

Andrew

suiji commented 4 years ago

Pyborist is not yet ready for prime time. The author plans to publish on a separate account, which we will link once available.

For what it's worth, the core interface to the bridge, i.e., the C++ API, is complete. We don't anticipate making changes unless the Pyborist maintainers encounter problems.

The author plans to publish Pyborist and deFramePy as two separate packages, the former depending on the latter. We plan to do something similar with Rborist, namely, ship a separate deFrameR package on which it depends. Anyway, you can expect deFramePy to appear first, with Pyborist following shortly thereafter.

Obviously, we want this to happen sooner, rather than later. The project had been dormant for at least two years. The original authors have moved on, though, and it's been only relatively recently that someone has stepped up with an interest in writing a new bridge. Freezing the C++ API, though, has been a boon to that effort.

dorian821 commented 4 years ago

awesome! glad to hear that the project has resumed. I'm interested in using this implementation for an implementation of permuted random forests. Given your experience, do you think that there are any possible ways to optimize (in terms of runtime) the forest generation for producing relative feature importances as opposed to prediction accuracy?

suiji commented 4 years ago

Thank you for chiming in. It's nice to know there are people who still want the Python version.

For permutation-based feature importance, I assume you mean the slow process of retraining, predictor-many times, with a different shuffled column each time. If so, then you probably want to economize on training time.

There are at least a few ways to do this, some of which might be described in the vignette. Several involve building smaller trees. The number of levels can be constrained, for example, by specifying a suitable 'nLevel' value. The relative information threshold can be capped using 'minInfo', which specifies a ratio below which splitting will not take place. Finally, 'minNode' sets a minimum on the size of nodes undergoing splitting. You probably do not want to use 'maxLeaf' for this purpose, though: while it does result in smaller trees, it is a post-training procedure.

Another way to train more quickly is to do less of it. Feature sampling can be reduced by specifying a lower probability of sampling, via 'predProb', or by specifying a smaller 'predFixed' value (known by other packages as 'mtry'). The Ranger package, for example, employs a default 'mtry' value equal to the square root of the number of predictors - even in the regression case - yet remains both fast and accurate. Observation sampling, similarly, can be capped by specifying a smaller value for 'nSamp'.

dorian821 commented 4 years ago

hi, thanks for the suggestions and the quick revert. yes, I'm interested in that (currently) extremely slow process of training 100's - 1000's of random forests with a permuted response actually, as opposed to iteratively permuting features. Currently the process is very slow using the packages available for python, and I'm hoping that digging a bit deeper and stripping down the process we might be able to speed it up. Your suggestions using the exposed parameters are good, and this is as far as I've got, but I'm curious if there are any parts in the underlying tree building process that might be trimmed down to help speed things up.

suiji commented 4 years ago

If you are training over the same observations many times, you probably want first to preformat the data. This gets the observations into a normal form right from the start, rather than recomputing the same form at each training invocation. This is what we advise users of Caret to do, for example.

Beyond that, performance tends to be a function of your data, typically: regression vs. classification; factor vs. numerical vs. mixed observations; wide vs. tall vs. square design matrix. You may be able to save some time by simplifying the splitting function but, as observation count grows at least, the most expensive operation is repartitioning the data: ultimately, you are dealing with irregular memory access.

suiji commented 7 months ago

The inter-language bridge seems to be maturing. We're open to suggestions for a Python interface look-and-feel.