suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

does specifying regMono really ensure monotonic decrease/increase #50

Open csetzkorn opened 4 years ago

csetzkorn commented 4 years ago

I have one numeric IV and I used -1 for this IV in the vector for regMono. I then plotted the DV's profile for different IV values keeping everything else fixed. Unfortunately, the values do not always decrease monotonically for every value of the IV. I cannot really create a reproducible example as the data in confidential. Is there actually a guarantee, that when you use -1 in regMono for a IV, that this ensures monotonic decrease all the time?

suiji commented 4 years ago

In short, no, this is not true isotonic regression. I believe that even the simple Iris example you provided illustrates this. The constraint is only enforced locally: when trial splits are evaluated for the constrained variable, those splits violating the constraint are rejected, but splits violating constraints with respect to other variables are not enforced.

It would be fairly straightforward to provide a post-splitting pass that rechecks all constraints over all splits. This would enforce the constraints globally, but would yield suboptimal splits, as good splits satisfying all constraints would already have been overlooked. The optimal approach may in fact be to check the entire set of constraints along each trial split. This can be done, but is likely to be very slow, requiring a trial repartitioning of the data for each constraint.

Have you tried giving extra weight to the constrained predictors? This is, admittedly, just a heuristic, but should result in those predictors' having more influence in the model. This may not help, though, if the predictors are working at cross purposes.

suiji commented 4 years ago

The putative optimal approach need not be as slow as first suggested. If, instead of just finding a "best" local split, we provided a set of splits ordered by quality, it would then be possible to enforce the global constraints after splitting. This would still entail generating trial repartitions, but would not be as costly as attempting to enforce all constraints locally.

csetzkorn commented 4 years ago

OK thanks. I also have this issue now:

https://stackoverflow.com/questions/57887801/upgrade-r-version-for-machine-learning-server-for-windows

So I possibly have to find another package/solution anyway.

Just wondering, if monotonicity cannot be ensured, what is the advantage of RBorist over for example:

https://cran.r-project.org/web/packages/quantregForest/quantregForest.pdf

I was trying to use quantile regression and log-ing my dependent variable. This kind of ensures monotonicity. Unfortunately, I end up with singular matrices so quantile regression does not work/converge ...

Sorry this ended up being a conversation rather than an issue thread ...

suiji commented 4 years ago

The way we estimate quantiles is similar to Meinshausen, possibly even the same. Ranger is doing something similar, according to their website. The primary advantage over quantRegForest ("QRF"), then, is speed, as QRF employs the randomForest package. Ranger also has a reputation for speed, although Rborist tends to be faster as observation count increases. Further, we retain the potential to do "true" quantile regression, as outlined by Athey, Wager et al. in their paper on Generalized Random Forest, as well as PRIM, BART, CART and a host of other recursive partitioning algorithms. So both speed and extensibility are important calling cards.

We should no doubt attempt to support true isotonic regression, but there are many other enhancements on queue. It would probably be a few weeks' work for someone familiar with the code. Unfortunately, yours is among the only interest we've had in monotonic constraints since its introduction in 2016.

Both the GB and Xgboost tools also support monotonic constraints, albeit for gradient-boosted trees. I do not know whether either package enforces the constraints globally. Your best bet may be to try BART, which introduced monotonic constraints last year. Also worth checking out is Xbart, a faster BART implementation, which may or may not yet feature monotonicity.

Issue threads seem like a perfectly valid place to have these discussions.

csetzkorn commented 4 years ago

I am surprised that I am the only one who is interested in monotonic constraints (-: Thanks for all your input!

flippercy commented 4 years ago

Monotonicity is super important to me, too. Surprisingly that Rborist is the only available package in R support this feature in a random forest model. It would be great that if this feature can be enforced globally!

chancejohnstone commented 3 years ago

What happens when a non-monotone split is rejected for a monotone-constrained feature?

suiji commented 3 years ago

The split is "awarded" to the whichever of the remaining predictors under consideration, if any, maximize the information criterion.

suiji commented 3 years ago

Because of the way the observations order during a splitting survey, it is not always possible to draw a conclusion about the monotonicity of say, variable 'a', while attempting to split a different variable, say 'b'. If 'a' and 'b' are highly correlated in either direction then monotonicity in one of the variables can be enforced while attempting to split the other. In general, though, the way we constrain for monotonicity is local.

Chancellor Johnstone examines some of these issues in his recent dissertation (Iowa State University, 2020). In particular, there is recent work suggesting improvements are possible. Stay tuned.

csetzkorn commented 3 years ago

Just FYI there is this package, I used a bit:

https://cran.r-project.org/web/packages/qrnn/qrnn.pdf

It is an ANN with monotonicity. Might help someone else?