RWeka classifier hyperparameters

topepo / caret

caret (Classification And Regression Training) R package that contains misc functions for training and plotting classification and regression models

http://topepo.github.io/caret/index.html

1.62k stars 632 forks source link

RWeka classifier hyperparameters #468

Closed dashaub closed 8 years ago

dashaub commented 8 years ago

Is there a reason that some of the RWeka classifiers only have one specified hyperparameter? The LMT model has multiple available values for hyperparemeters for grid/random search while others do not. For example, J48 has C = 0.25, PART has threshold = 0.25 and pruned = "yes", and JRip has NumOpt = 1:len.

library(caret)
data(iris)
library(RWeka)

# No other hyperparameter values, no random search
getModelInfo("J48")

# No other hyperparameter values, no random search
getModelInfo("PART")

# This needs sampling for random search
getModelInfo("JRip")
# Also it appears NumOpt it now named O in RWeka
RWeka::WOW("JRip")

# Other values of these hyperparameters run when called directly through RWeka

# J48 called through RWeka
j48c1 <- J48(Species ~ ., data = iris, control = Weka_control(C = 0.2))
j48c2 <- J48(Species ~ ., data = iris, control = Weka_control(C = 0.25))

# JRip called through RWeka
jripc1 <- JRip(Species ~ ., data = iris, control = Weka_control(O = 3))
jripc2 <- JRip(Species ~ ., data = iris, control = Weka_control(O = 100))

# PART called through RWeka
partc1 <- PART(Species ~ ., data = iris, control = Weka_control(threshold = 0.25, pruned = "yes"))
partc2 <- PART(Species ~ ., data = iris, control = Weka_control(threshold = 0.2, pruned = "no"))
partc3 <- PART(Species ~ ., data = iris, control = Weka_control(threshold = 0.25, pruned = "yes"))
partc4 <- PART(Species ~ ., data = iris, control = Weka_control(threshold = 0.2, pruned = "no"))

I can do some testing on what are decent values for these hyperparameters and put together a PR if necessary.

topepo commented 8 years ago

I don't think that the confidence values really does anything (useful). Feel free to make a PR though; I'd be happy to have more parameters for these models.

dashaub commented 8 years ago

It looks like it might do something on the spam dataset:

library(kernlab)
data(spam)
set.seed(34)
possiblecValue <- seq(from = 0.001, to = 0.5, length.out = 100000)
numModels <- 50
cValue <- sample(possiblecValue, numModels)
pctCorrect <- MAE <- Kappa <- rep(0, numModels)
for(i in 1:numModels){
  print(i)
  j48Mod <- J48(type ~ ., data = spam, control = Weka_control(C = cValue[i]))
  evaluation <- evaluate_Weka_classifier(j48Mod)$details
  pctCorrect[i] <- evaluation["pctCorrect"]
  Kappa[i] <- evaluation["kappa"]
  MAE[i] <- evaluation["meanAbsoluteError"]
}

plot(cValue, pctCorrect)
plot(cValue, MAE)
plot(cValue, Kappa)

On iris it looks like there is merely an important hard threshold where the models change:

set.seed(34)
possiblecValue <- seq(from = 0.001, to = 0.5, length.out = 100000)
numModels <- 500
cValue <- sample(possiblecValue, numModels)
pctCorrect <- MAE <- Kappa <- rep(0, numModels)
for(i in 1:numModels){
  print(i)
  j48Mod <- J48(Species ~ ., data = iris, control = Weka_control(C = cValue[i]))
  evaluation <- evaluate_Weka_classifier(j48Mod)$details
  pctCorrect[i] <- evaluation["pctCorrect"]
  Kappa[i] <- evaluation["kappa"]
  MAE[i] <- evaluation["meanAbsoluteError"]
}

plot(cValue, pctCorrect)
plot(cValue, MAE)
plot(cValue, Kappa)

This behavior might be related to how it treats three-class vs two-class classification problems. I'll look into it more.

dashaub commented 8 years ago

I've taken a look at the tuning parameters for these models; you are correct that most of them don't do anything for model performance (e.g. change output formatting, collapse the tree nodes when possible, print debug information, etc). There are a couple additional tunable parameters that do change the model's predictions. The useful/error free ranges sometimes depend on the dataset, e.g. 1:nrow(trainData) or 1:floor(nrow(trainData) / 2). How would I incorporate these dynamic ranges into the grid function used in caret?

dashaub commented 8 years ago

D'oh! x and y are right there in the function header for grid. Ok, I'll work on putting this together.

dashaub commented 8 years ago

How do you feel about hardcoded upper limits on parameters (such as nrounds for "xgboost")? The NumOpt is currently set dynamically by 1:len and a new NumFolds parameter could be set similarly, but realistically the optimal value on a couple of test sets (iris, spam, and one simulated with twoClassSim()) is <= 25. Large values for NumOpt also take much longer to fit.

Finally, the current method will produce lots of accuracy NA values when NumOpt >= nrow(trainingData), e.g. train(Species ~ ., data = iris, method = "JRip", tuneLength = 1000 or when a training CV fold has fewer observations than NumOpt; when fitting many models like this on a smaller dataset, only a small fraction of the fitted models will actually train properly.

topepo commented 8 years ago

What model are we talking about? It would help of you mapped them out, e.g.

J48: confidence value C with range (0, .5]
PART: threshold [?, .25], pruned ["yes", "no"]

etc.

In general, there's no issue with having hardcoded limits when they make sense. I've tried to keep the ranges when search = "grid" to be a little narrow in comparison to when search = "random".

dashaub commented 8 years ago

This was in reference to JRip. You're right that most of the tuning parameters on J48 and PART don't seem to do much. I'm not done testing, but I'm seeing the greatest variability of tuning parameters on model accuracy with JRip.

dashaub commented 8 years ago

Here are some results on the spam dataset in "kernlab" for the tuning parameters. All results display 10-fold CV Kappa vs the relevant tuning parameter with all others set to default. In general, it looks like there are two types of behaviors: responses that are insensitive (or too noisy to determine the pattern from a single CV procedure) or tuning parameters that have a high upper limit on allowing the model to run but have poor model performance in those regions and probably need a much lower, possibly hardcoded limit. I'm investigating this latter case in more detail on different datasets. It would be nice to set these hardcoded ranges "wide" enough to allow optimal models on many datasets while also not setting it too wide so that the density of explored values is lower and we're wasting lots of training iterations for most datasets in poor-performing regions. Of course users could use grid search, but these look like a case study for where random search is useful. j48--c This appears quite insensitive and noisy from the CV point estimates. j48--m Like many of these models, the M parameter here has a running upper bound around nrow(trainData) / 2 but in practice we might want to hardcode this to something lower like min(nrow(trainData) / 2, 100)

jrip--n Similar to above M for J48. The models fail for values greater than around nrow(trainData) / 6. Realistically values greater than 100 probably aren't good on most datasets. I'm following up on this with more datasets. jrip--o Similar to above C for JRip. Larger values of O increase runtime (should be a linear increase). An upper hardcoded limit like 50 or 100 could make sense here. I'm following up on this with more datasets. jrip--f Models fail around nrow(trainData) / 2. An upper hardcoded limit like 50 or 100 could make sense here. I'm following up on this with more datasets. part--m Models rail around nrow(trainData) / 2 but realistically should be probably capped much lower. I'm following up on this with more datasets. part--c Another insensitive/noisy parameter

Next step is trying this on another, much larger(e.g. 10^5 rows) training set simulated from twoClassSim() and a smaller set like iris()

dashaub commented 8 years ago

Furthermore, setting up sampling for many of these on a log scale could make sense.

topepo commented 8 years ago

-M (minimum number of instances). Quinlan's models always set this really really low (to me). I would suggest using a range of 3 to 50. That's daily consistent with gbm, rpart and some other models.
-C (confidence value) [0, .5] is a good wide range but I wouldn't expect much (as you see).

I'd never seen -F ("number of folds used for REP"). I don't even know what that means.

dashaub commented 8 years ago

The iris and BreastCancer(from "mlbench") dataset perform well with -M set to 1, so setting this to a wider 1:50 could make sense for some datasets--probably small ones.
Agreed on -C--That one has an easy range to choose
The description for -F in WOW("JRip") isn't that great. The best I can gather is it has to do with reduced error pruning and setting the number of folds in the IREP algorithm: http://weka.sourceforge.net/doc.dev/weka/classifiers/rules/JRip.html This description also appears relevant, but I haven't read deeply into the description of these algorithms: https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/JRip#Synopsis For reference, the "mlr" package also sets up -F as a tuning parameter. It doesn't appear to do much, outside of when the value is large, in which case is kills model performance. It could make sense to put this in a tight range like 1:50. See additional example graphs below.

Here are some results on additional datasets.

part--twoclasssim--m The other models failed on a large 100k row synthetic dataset generated from twoClassSim(), so this is the only result I have on this dataset. part--breastcancer--m part--iris--m Setting -F to 1:50 would work well on all four datasets tested.

j48--breastcancer--m j48--iris--m Using -M set to 1:50 looks to capture any worthwhile model on the spam, iris, and BreastCancer datasets. This also looks like a decent range for -F and -N jrip--breastcancer--f jrip--iris--f jrip--breastcancer--n jrip--iris--n The -O parameter looks noisy with a possible very slight upward trend between [1, 50]. Since larger values take longer to fit, a small range like 1:25 seems sensible. jrip--breastcancer--o jrip--iris--o

Finally, in addition to the hardcoded limits suggested above, some of these parameters should have their ranges restricted based on the dimensions of the data to protect from "corner cases" where small datasets are fed in and there aren't enough observations to support some of the larger tuning parameter values. This could be achieved through something like min(50, round(nrow(trainData)/2)), but I suspect the exact formula would depend on 1) the class balance and 2) the number of classes.

It's difficult to setup a tuning range that simultaneously 1) includes all plausible parameter values 2) is as small as possible (excludes many unplausible models and wasted computation) and 3) does not include any models that entirely fail to fit. Automating the search through the range is of course the point of "caret", so I'm inclined to believe including all plausible ranges takes the highest priority. Thoughts? I'm thinking since the min solution above will catch many of these cases it is probably ok without engineering a complicated tuneGrid based on class balance, number of classes, etc. For my personal workflow, I care a lot about failed models on large datasets (where fitting is expensive and I'm not trying as many models) whereas on small datasets I'd be fine to set a huge tuneLength since fitting is cheap and I can't rely on the models having the luxury of easily learning the function space through a large number of observations. Thoughts on this?

dashaub commented 8 years ago

Followup:

evaluate_Weka_classifier() looks to have a really suspicious default seeding and sampling method for CV: it appears to use very few unique seeds. If doing this again, writing bespoke fitting and CV looks preferable, but for an exploratory analysis of the ranges, hopefully this isn't an issue.
The "RWeka" learners don't seem to scale well to large datasts--both in terms of memory usage and runtime--so most of the tuning range considerations seem only relevant for small to mediums (e.g. nrow(trainData) < 100000) data.

dashaub commented 8 years ago

I put together PR #477 with this. The models with the new tune grids decent better predictive performance improvements over the original tune grids for both random and grid search.

#Current caret
library(caret)
set.seed(5)
numModels <- 100
a <- twoClassSim(100, linearVars = 3)
ctrlR <- trainControl(method = "repeatedcv",
                     number = 10, repeats = 5,
                     search = "random", verboseIter = TRUE)
ctrlG <- trainControl(method = "repeatedcv",
                      number = 10, repeats = 5,
                      search = "grid", verboseIter = TRUE) # Fit PART with random and grid search
set.seed(5)
partModR <- train(Class ~ ., data = a, method = "PART",
                  tuneLength = numModels, trControl = ctrlR)
set.seed(5)
partModG <- train(Class ~ ., data = a, method = "PART",
                  tuneLength = numModels, trControl = ctrlG) # Fit J48 with random and grid search
set.seed(5)
j48ModR <- train(Class ~ ., data = a, method = "J48",
                  tuneLength = numModels, trControl = ctrlR)
set.seed(5)
j48ModG <- train(Class ~ ., data = a, method = "J48",
                  tuneLength = numModels, trControl = ctrlG)

# Fit JRip with random and grid search
set.seed(5)
jripModR <- train(Class ~ ., data = a, method = "JRip",
                 tuneLength = numModels, trControl = ctrlR)
set.seed(5)
jripModG <- train(Class ~ ., data = a, method = "JRip",
                 tuneLength = numModels, trControl = ctrlG)

# These were fit with the new caret changes
set.seed(5)
numModels <- 100
a <- twoClassSim(100, linearVars = 3)
ctrlR <- trainControl(method = "repeatedcv",
                      number = 10, repeats = 5,
                      search = "random", verboseIter = TRUE) ctrlG <- trainControl(method = "repeatedcv",
                      number = 10, repeats = 5,
                      search = "grid", verboseIter = TRUE)
set.seed(5)
partModR2 <- train(Class ~ ., data = a, method = "PART",
                  tuneLength = numModels, trControl = ctrlR)
set.seed(5)
partModG2 <- train(Class ~ ., data = a, method = "PART",
                  tuneLength = numModels, trControl = ctrlG) # Fit J48 with random and grid search
set.seed(5)
j48ModR2 <- train(Class ~ ., data = a, method = "J48",
                 tuneLength = numModels, trControl = ctrlR)
set.seed(5)
j48ModG2 <- train(Class ~ ., data = a, method = "J48",
                 tuneLength = numModels, trControl = ctrlG)

# Fit JRip with random and grid search
set.seed(5)
jripModR2 <- train(Class ~ ., data = a, method = "JRip",
                  tuneLength = numModels, trControl = ctrlR)
set.seed(5)
jripModG2 <- train(Class ~ ., data = a, method = "JRip",
                  tuneLength = numModels, trControl = ctrlG)

And the results

> getTrainPerf(partModG)
  TrainAccuracy TrainKappa method
1     0.7291919  0.4495748   PART
> getTrainPerf(partModR)
  TrainAccuracy TrainKappa method
1     0.7291919  0.4495748   PART
> getTrainPerf(j48ModG)
  TrainAccuracy TrainKappa method
1         0.736   0.468541    J48
> getTrainPerf(j48ModR)
  TrainAccuracy TrainKappa method
1         0.736   0.468541    J48
> getTrainPerf(jripModG)
  TrainAccuracy TrainKappa method
1     0.7867273  0.5672454   JRip
> getTrainPerf(jripModR)
  TrainAccuracy TrainKappa method
1     0.7867273  0.5672454   JRip

> getTrainPerf(partModG2)
  TrainAccuracy TrainKappa method
1     0.7308283  0.4530283   PART
> getTrainPerf(partModR2)
  TrainAccuracy TrainKappa method
1     0.7308283  0.4530283   PART
> getTrainPerf(j48ModG2)
  TrainAccuracy TrainKappa method
1     0.7559394  0.5051134    J48
> getTrainPerf(j48ModR2)
  TrainAccuracy TrainKappa method
1     0.7956162  0.5740688    J48
> getTrainPerf(jripModG2)
  TrainAccuracy TrainKappa method
1     0.7988485  0.5935932   JRip
> getTrainPerf(jripModR2)
  TrainAccuracy TrainKappa method
1      0.795596  0.5822646   JRip