suiji / Arborist

Scalable decision tree training and inference.
Other
82 stars 14 forks source link

Feature request: add control over split point position #24

Closed talgalili closed 7 years ago

talgalili commented 7 years ago

Based on my work here: https://arxiv.org/abs/1611.04561 I learned that choosing a split point in a numerical explanatory variable can have an impact on prediction of future observations. For example, for the following data: x = c(1,2,3,4) y = factor(c(1,1,0,0)) The split point could be any place between x=2 and x=3. Arborist chooses to place it at 2.5 (which, I believe, is the best point). However, it would be nice if there was a switch that would enable the user to choose to make the split at the left-most place (i.e.: x=2). This would enable us to have more research on the benefit of each choice.

For example, this was recently added to partykit (see here: https://github.com/talgalili/partykit)

Any chance to add this feature?

Thanks.

suiji commented 7 years ago

The splitting criterion attempts to approximate a mean rank. For example, if the two observations bounding an optimal split have ranks, say, 7 and 13, then the splitting criterion employs the value of the predictor at rank 10. If the mean "rank" is fractional, Arborist interpolates a value between the two adjacent ranks: 0.5 * (ith rank + i+1st rank).

We can generalize the role of 0.5, above, to values in [0,.0, 1.0]. In the above example, then, a value of 0.16 would select a cut close to rank 8, while a value of 0.25 would select a value interpolated between ranks 8 and 9. Such a scheme would be very easy to implement.

Would that help?

talgalili commented 7 years ago

I am not sure I get it. Say we have the following data: x = c(1,2, 10,11) y = c(1,1,0,0)

So a middle split will be at x=6 (i.e., for x>6 -> y=0). Is this what your method does? (I believe it is from attempts I've made before). The question is can you add a feature to choose the split to be at the left rank (hence, starting from x>2 -> y=0). Such a parameter would help people like me running experiments of the two methods.

Thanks.

suiji commented 7 years ago

I am not sure I get it. Say we have the following data: x = c(1,2, 10,11) y = c(1,1,0,0) So a middle split will be at x=6 (i.e., for x>6 -> y=0). Is this what your method does? (I believe it is from attempts I've made before).

Not quite. The splitting method does not employ predictor values, but rather their ranks. Splitting values are later interpolated from a central rank.

In this example, if the corresponding ranks of 'x' were 0,1,2 and 3, then the central rank would be between 1 and 2 so, indeed, the value of 6 would be chosen. This might not be the case, though, if the ranks over which the split is to be evaluated are not so well behaved. What might not have been clear is that the ranks in question are ranks with respect to the full data set, not with respect to the sample over which the split is evaluated.

The question is can you add a feature to choose the split to be at the left rank (hence, starting from x>2 -> y=0). Such a parameter would help people like me running experiments of the two methods.

The approach I suggested in the earlier response would provide an option, call it "splitScale", which allows the split to be placed anywhere between the two bounding ranks. Current behavior is equivalent to "splitScale = 0.5", and would be the default. The behavior you want to test could be achieved by setting "splitScale=0.0". All I'm suggesting is that we extend your idea to allow the split to be placed anywhere in the interval via a scaling factor (i.e., quantile).

talgalili commented 7 years ago

Got it, thanks. I'd be happy for your proposed solution. :) Tal

suiji commented 7 years ago

This should be ready soon. The new option will probably be called "splitQuantile", to help clarify what is actually happening.

In fact, I could probably have saved some confusion by pointing out, somewhere, that the splits are derived by trying to interpolate a median from among the samples, as opposed to the "p_SB" estimator described in the preprint. The gripe I have about "p_SB" is that it does not make use of information - already in hand - concerning the distribution of the splitting predictor. If it is still possible to edit the preprint, please point out that the Arborist uses "p_median", or some such.

I will attempt to make some of this clear in the documentation for "splitQuantile".

talgalili commented 7 years ago

Interesting. I suspect that since you don't have a way to add unsupervised information (at least that I've seen), then your method is estimated median point which is practically p_SB (i.e.: mid point in the range of possible splits given the data). The mid point is median/mean if the distribution in that region of the data is unimodal around the midpoint (or just uniform). Using the "real" median point would require knowledge about the underlying distribution of X which, in my paper, I claim would be a very good solution. But practical cases where this information is available are not common.

Regardless of what I wrote above, I'd be very happy for your proposed splitQuantile parameter, and would gladly keep tracking this thread until it is added.

With regards, Tal

suiji commented 7 years ago

I suspect that since you don't have a way to add unsupervised information (at least that I've seen), then your method is estimated median point which is practically p_SB (i.e.: mid point in the range of possible splits given the data).

Not certain I agree with this. Let's use your example of x <- c(1, 2, 10, 11):

If the respective ranks were, say, (0, 1, 11, 12), then a split between values 2 and 10 would impute an intermediate rank equal to 6. The cut point will be assigned whatever value corresponds to a rank of 6. This could be any value in the interval (2.0, 10.0), depending on how 'x' is distributed. In any case, the Arborist simply looks up the value taken at that rank. Further, if the imputed "rank" is fractional, a value can be interpolated. The derived cut point can therefore differ significantly from the midpoint value of 6. Hence the characterization of the splitting method as essentially "p_SB" does not appear to be correct.

Although the Arborist does not add unsupervised information, it does apply global information about the predictor's distribution in order to derive a splitting criterion for a localized region.

I'd be very happy for your proposed splitQuantile parameter, and would gladly keep tracking this thread until it is added.

It has been checked in to Github.

talgalili commented 7 years ago

I think I understand you better now. If I did, you are saying that if we had X1 and X2, we might use X1 for the first split in some tree, and then when (say) making a decision on a split based on X2 we might only have a subset of X2 with a subset of its ranks and in such a case you would use the middle rank. In which case, what you are doing is what I demonstrate in the paper as using the ECDF for estimating the CDF for the split point interpolation. Based on my simulations this solution is biased to be better for splits that are closer to the median of the distribution of X2 and worse for splits that are near the edge of the quantiles of X2. Either way, your solution is still much better than p_L or p_R - so it is a reasonable heuristic :)

Tal

suiji commented 7 years ago

I think I understand you better now. If I did, you are saying that if we had X1 and X2, we might use X1 for the first split in some tree, and then when (say) making a decision on a split based on X2 we might only have a subset of X2 with a subset of its ranks and in such a case you would use the middle rank.

Yes, this sounds like the same idea. I probably come at the problem from a different perspective, thinking of splitting as an operation on indexed subsets of the bagged samples. The operation of splitting the root set incorporates all ranks of the ECDF, and so would be "global": this would correspond to the operation leading to your result of predictor X1. Subsequent splitting operations incorporate only subsets of the bagged samples, and become progressively more "local" further down the tree: X2, X3, &c.

The shortcoming of p_L, p_R and p_SB, IMHO, is that all three derive a cut value by restricting attention to local information. Even p_median, if restricted simply to consideration of local ranks, is likely no better - as you have already pointed out. Arborist, OTOH, employs global (i.e, the original, pre-bagging values) ranks when evaluating these local cut points, toward the goal of better exploiting the sample distribution of the predictor. Perhaps the Arborist method should be denoted "p_mgr", for "mean global rank".

In which case, what you are doing is what I demonstrate in the paper as using the ECDF for estimating the CDF for the split point interpolation. Based on my simulations this solution is biased to be better for splits that are closer to the median of the distribution of X2 and worse for splits that are near the edge of the quantiles of X2.

Do we have the same method, then? I have not made it all the way through the preprint, and have not read all that you have done. The Arborist has been using p_mgr for about one year, maybe more, but there has not been exhaustive testing of p_mgr versus p_SB.

Either way, your solution is still much better than p_L or p_R - so it is a reasonable heuristic :)

Yes, a heuristic. Unlike your paper, I have not approached it from a theoretical perspective.

I plan to reimplement "splitQuantile" as a vector, so that predictors may be treated individually. This should not take too long. Technically, we should probably call it "splitSubQuantile", as the mechanism actually selects quantiles within a range of ranks, but that would be a little much.

suiji commented 7 years ago

"splitQuantile" is now implemented as vector, so that these constraints may be imposed on a per-predictor basis.

This thread is being closed, but please feel free to reopen if there is more to discuss on the topic.