vonjd / OneR

This R package implements the One Rule (OneR) Machine Learning classification algorithm with enhancements for sophisticated handling of numeric data and missing values together with extensive diagnostic functions.
Other
40 stars 3 forks source link

monotonic predictors #15

Open ggrothendieck opened 1 year ago

ggrothendieck commented 1 year ago

In some applications the predictors are monotonic and it does not make sense to have more than one split on a predictor. Suggest having an option to limit trees to a single split.

vonjd commented 1 year ago

Thank you for your suggestion. I am not quite sure whether I understand it correctly. Could you provide a concrete example? Thank you again

ggrothendieck commented 1 year ago

The following should be increasing but it tries to fit the 0.

library(OneR)
BOD[5, 2] <- 0
OneR(demand ~ Time, BOD)

## Call:
## OneR.formula(formula = demand ~ Time, data = BOD)
##
## Rules:
## If Time = (0.994,2.2] then demand = (7.92,11.9]
## If Time = (2.2,3.4]   then demand = (15.8,19.8]
## If Time = (3.4,4.6]   then demand = (15.8,19.8]
## If Time = (4.6,5.8]   then demand = (-0.0198,3.96] <----------------------------
## If Time = (5.8,7.01]  then demand = (15.8,19.8]
##
## Accuracy:
## 6 of 6 instances classified correctly (100%)

Although there exist examples of non-monotonic dose-response curves usually a higher dose leads to a higher response.

Another example, is valuation. Suppose we want to predict the price of a house based on number of bedrooms and other predictors. More bedrooms should lead to a higher valuation.

Usually it is sufficient to guarantee monotonicity without specifying the direction and if the direction is opposite from expected then we can re-examine our assumptions or reject that predictor. A simple way to ensure monotonicity is to allow only one split on each predictor which I am assuming would be easy to implement.

vonjd commented 1 year ago

Ok, if I understand you correctly, your question is about the number of splits of the predictors. This should be quite easy to achieve by using the bin() function with the nbinsargument, specifying the number of bins before you use the OneR() function on the resulting dataframe (please also consult the documentation for bin()for an example). So, if you only want one split per predictor you should set nbins = 2.

Please try this and come back to tell me if it solved your problem. Thank you

ggrothendieck commented 1 year ago

That splits each column into two but it doesn't seem to do it optimally. If we try the previous example it gets 2 predictions wrong but it would be possible to get 0 wrong if the splits were different.

BOD[5, 2] <- 0
BOD2 <- bin(BOD, 2)
OneR(demand ~ Time, BOD2)
##
## Call:
## OneR.formula(formula = demand ~ Time, data = BOD2)
##
## Rules:
## If Time = (0.994,4] then demand = (9.9,19.8]
## If Time = (4,7.01]  then demand = (-0.0198,9.9]
##
## Accuracy:
## 4 of 6 instances classified correctly (66.67%)  <-------------------------------
ggrothendieck commented 1 year ago

Just a follow up to the last post. We compute all possible increasing splits into 2 and find that the split Time > 5 demand > 19 gives 0 wrong predictions.

BOD[5,2] <- 0

noWrong <- function(i,j) {
 x <- BOD$Time > BOD$Time[i]
 y <- BOD$demand > BOD$demand[j]
 if (all(x == y)) cat("i=", i, "j=", j, 
   "Time >", BOD$Time[i], "demand >", BOD$demand[j], 
    "x=y=", paste(+x, collapse = ""), "\n")
 sum(x != y)
}

BOD
##   Time demand
##  1    1    8.3
##  2    2   10.3
##  3    3   19.0
##  4    4   16.0
##  5    5    0.0
##  6    7   19.8

outer(1:5, 1:5, Vectorize(noWrong))
##       [,1] [,2] [,3] [,4] [,5]
##  [1,]    1    2    4    3    2
##  [2,]    2    1    3    2    3
##  [3,]    3    2    2    3    4
##  [4,]    4    3    1    2    5
##  [5,]    3    2    0    1    4

noWrong(5, 3)
## i= 5 j= 3 Time > 5 demand > 19 x=y= 000001 
## [1] 0