Closed dashaub closed 8 years ago
I don't think that the confidence values really does anything (useful). Feel free to make a PR though; I'd be happy to have more parameters for these models.
It looks like it might do something on the spam
dataset:
library(kernlab)
data(spam)
set.seed(34)
possiblecValue <- seq(from = 0.001, to = 0.5, length.out = 100000)
numModels <- 50
cValue <- sample(possiblecValue, numModels)
pctCorrect <- MAE <- Kappa <- rep(0, numModels)
for(i in 1:numModels){
print(i)
j48Mod <- J48(type ~ ., data = spam, control = Weka_control(C = cValue[i]))
evaluation <- evaluate_Weka_classifier(j48Mod)$details
pctCorrect[i] <- evaluation["pctCorrect"]
Kappa[i] <- evaluation["kappa"]
MAE[i] <- evaluation["meanAbsoluteError"]
}
plot(cValue, pctCorrect)
plot(cValue, MAE)
plot(cValue, Kappa)
On iris
it looks like there is merely an important hard threshold where the models change:
set.seed(34)
possiblecValue <- seq(from = 0.001, to = 0.5, length.out = 100000)
numModels <- 500
cValue <- sample(possiblecValue, numModels)
pctCorrect <- MAE <- Kappa <- rep(0, numModels)
for(i in 1:numModels){
print(i)
j48Mod <- J48(Species ~ ., data = iris, control = Weka_control(C = cValue[i]))
evaluation <- evaluate_Weka_classifier(j48Mod)$details
pctCorrect[i] <- evaluation["pctCorrect"]
Kappa[i] <- evaluation["kappa"]
MAE[i] <- evaluation["meanAbsoluteError"]
}
plot(cValue, pctCorrect)
plot(cValue, MAE)
plot(cValue, Kappa)
This behavior might be related to how it treats three-class vs two-class classification problems. I'll look into it more.
I've taken a look at the tuning parameters for these models; you are correct that most of them don't do anything for model performance (e.g. change output formatting, collapse the tree nodes when possible, print debug information, etc). There are a couple additional tunable parameters that do change the model's predictions. The useful/error free ranges sometimes depend on the dataset, e.g. 1:nrow(trainData)
or 1:floor(nrow(trainData) / 2)
. How would I incorporate these dynamic ranges into the grid function used in caret?
D'oh! x
and y
are right there in the function header for grid. Ok, I'll work on putting this together.
How do you feel about hardcoded upper limits on parameters (such as nrounds
for "xgboost")? The NumOpt
is currently set dynamically by 1:len
and a new NumFolds
parameter could be set similarly, but realistically the optimal value on a couple of test sets (iris
, spam
, and one simulated with twoClassSim()
) is <= 25. Large values for NumOpt
also take much longer to fit.
Finally, the current method will produce lots of accuracy NA
values when NumOpt >= nrow(trainingData)
, e.g. train(Species ~ ., data = iris, method = "JRip", tuneLength = 1000
or when a training CV fold has fewer observations than NumOpt
; when fitting many models like this on a smaller dataset, only a small fraction of the fitted models will actually train properly.
What model are we talking about? It would help of you mapped them out, e.g.
J48
: confidence value C
with range (0, .5] PART
: threshold
[?, .25], pruned
["yes", "no"] etc.
In general, there's no issue with having hardcoded limits when they make sense. I've tried to keep the ranges when search = "grid"
to be a little narrow in comparison to when search = "random"
.
This was in reference to JRip
. You're right that most of the tuning parameters on J48
and PART
don't seem to do much. I'm not done testing, but I'm seeing the greatest variability of tuning parameters on model accuracy with JRip
.
Here are some results on the spam
dataset in "kernlab" for the tuning parameters. All results display 10-fold CV Kappa vs the relevant tuning parameter with all others set to default. In general, it looks like there are two types of behaviors: responses that are insensitive (or too noisy to determine the pattern from a single CV procedure) or tuning parameters that have a high upper limit on allowing the model to run but have poor model performance in those regions and probably need a much lower, possibly hardcoded limit. I'm investigating this latter case in more detail on different datasets. It would be nice to set these hardcoded ranges "wide" enough to allow optimal models on many datasets while also not setting it too wide so that the density of explored values is lower and we're wasting lots of training iterations for most datasets in poor-performing regions. Of course users could use grid search, but these look like a case study for where random search is useful.
This appears quite insensitive and noisy from the CV point estimates.
Like many of these models, the M
parameter here has a running upper bound around nrow(trainData) / 2
but in practice we might want to hardcode this to something lower like min(nrow(trainData) / 2, 100)
Similar to above M
for J48. The models fail for values greater than around nrow(trainData) / 6
. Realistically values greater than 100 probably aren't good on most datasets. I'm following up on this with more datasets.
Similar to above C
for JRip. Larger values of O
increase runtime (should be a linear increase). An upper hardcoded limit like 50 or 100 could make sense here. I'm following up on this with more datasets.
Models fail around nrow(trainData) / 2
. An upper hardcoded limit like 50 or 100 could make sense here. I'm following up on this with more datasets.
Models rail around nrow(trainData) / 2
but realistically should be probably capped much lower. I'm following up on this with more datasets.
Another insensitive/noisy parameter
Next step is trying this on another, much larger(e.g. 10^5 rows) training set simulated from twoClassSim()
and a smaller set like iris()
Furthermore, setting up sampling for many of these on a log scale could make sense.
-M
(minimum number of instances). Quinlan's models always set this really really low (to me). I would suggest using a range of 3 to 50. That's daily consistent with gbm
, rpart
and some other models.-C
(confidence value) [0, .5] is a good wide range but I wouldn't expect much (as you see). I'd never seen -F
("number of folds used for REP"). I don't even know what that means.
iris
and BreastCancer
(from "mlbench") dataset perform well with -M
set to 1, so setting this to a wider 1:50
could make sense for some datasets--probably small ones.-C
--That one has an easy range to choose-F
in WOW("JRip")
isn't that great. The best I can gather is it has to do with reduced error pruning and setting the number of folds in the IREP algorithm:
http://weka.sourceforge.net/doc.dev/weka/classifiers/rules/JRip.html
This description also appears relevant, but I haven't read deeply into the description of these algorithms:
https://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Classification/JRip#Synopsis
For reference, the "mlr" package also sets up -F
as a tuning parameter. It doesn't appear to do much, outside of when the value is large, in which case is kills model performance. It could make sense to put this in a tight range like 1:50
. See additional example graphs below.Here are some results on additional datasets.
The other models failed on a large 100k row synthetic dataset generated from twoClassSim()
, so this is the only result I have on this dataset.
Setting -F
to 1:50
would work well on all four datasets tested.
Using -M
set to 1:50
looks to capture any worthwhile model on the spam
, iris
, and BreastCancer
datasets. This also looks like a decent range for -F
and -N
The -O
parameter looks noisy with a possible very slight upward trend between [1, 50]. Since larger values take longer to fit, a small range like 1:25
seems sensible.
Finally, in addition to the hardcoded limits suggested above, some of these parameters should have their ranges restricted based on the dimensions of the data to protect from "corner cases" where small datasets are fed in and there aren't enough observations to support some of the larger tuning parameter values. This could be achieved through something like min(50, round(nrow(trainData)/2))
, but I suspect the exact formula would depend on 1) the class balance and 2) the number of classes.
It's difficult to setup a tuning range that simultaneously 1) includes all plausible parameter values 2) is as small as possible (excludes many unplausible models and wasted computation) and 3) does not include any models that entirely fail to fit. Automating the search through the range is of course the point of "caret", so I'm inclined to believe including all plausible ranges takes the highest priority. Thoughts? I'm thinking since the min
solution above will catch many of these cases it is probably ok without engineering a complicated tuneGrid
based on class balance, number of classes, etc. For my personal workflow, I care a lot about failed models on large datasets (where fitting is expensive and I'm not trying as many models) whereas on small datasets I'd be fine to set a huge tuneLength
since fitting is cheap and I can't rely on the models having the luxury of easily learning the function space through a large number of observations. Thoughts on this?
Followup:
evaluate_Weka_classifier()
looks to have a really suspicious default seeding and sampling method for CV: it appears to use very few unique seeds. If doing this again, writing bespoke fitting and CV looks preferable, but for an exploratory analysis of the ranges, hopefully this isn't an issue.nrow(trainData) < 100000
) data.I put together PR #477 with this. The models with the new tune grids decent better predictive performance improvements over the original tune grids for both random and grid search.
#Current caret
library(caret)
set.seed(5)
numModels <- 100
a <- twoClassSim(100, linearVars = 3)
ctrlR <- trainControl(method = "repeatedcv",
number = 10, repeats = 5,
search = "random", verboseIter = TRUE)
ctrlG <- trainControl(method = "repeatedcv",
number = 10, repeats = 5,
search = "grid", verboseIter = TRUE) # Fit PART with random and grid search
set.seed(5)
partModR <- train(Class ~ ., data = a, method = "PART",
tuneLength = numModels, trControl = ctrlR)
set.seed(5)
partModG <- train(Class ~ ., data = a, method = "PART",
tuneLength = numModels, trControl = ctrlG) # Fit J48 with random and grid search
set.seed(5)
j48ModR <- train(Class ~ ., data = a, method = "J48",
tuneLength = numModels, trControl = ctrlR)
set.seed(5)
j48ModG <- train(Class ~ ., data = a, method = "J48",
tuneLength = numModels, trControl = ctrlG)
# Fit JRip with random and grid search
set.seed(5)
jripModR <- train(Class ~ ., data = a, method = "JRip",
tuneLength = numModels, trControl = ctrlR)
set.seed(5)
jripModG <- train(Class ~ ., data = a, method = "JRip",
tuneLength = numModels, trControl = ctrlG)
# These were fit with the new caret changes
set.seed(5)
numModels <- 100
a <- twoClassSim(100, linearVars = 3)
ctrlR <- trainControl(method = "repeatedcv",
number = 10, repeats = 5,
search = "random", verboseIter = TRUE) ctrlG <- trainControl(method = "repeatedcv",
number = 10, repeats = 5,
search = "grid", verboseIter = TRUE)
set.seed(5)
partModR2 <- train(Class ~ ., data = a, method = "PART",
tuneLength = numModels, trControl = ctrlR)
set.seed(5)
partModG2 <- train(Class ~ ., data = a, method = "PART",
tuneLength = numModels, trControl = ctrlG) # Fit J48 with random and grid search
set.seed(5)
j48ModR2 <- train(Class ~ ., data = a, method = "J48",
tuneLength = numModels, trControl = ctrlR)
set.seed(5)
j48ModG2 <- train(Class ~ ., data = a, method = "J48",
tuneLength = numModels, trControl = ctrlG)
# Fit JRip with random and grid search
set.seed(5)
jripModR2 <- train(Class ~ ., data = a, method = "JRip",
tuneLength = numModels, trControl = ctrlR)
set.seed(5)
jripModG2 <- train(Class ~ ., data = a, method = "JRip",
tuneLength = numModels, trControl = ctrlG)
And the results
> getTrainPerf(partModG)
TrainAccuracy TrainKappa method
1 0.7291919 0.4495748 PART
> getTrainPerf(partModR)
TrainAccuracy TrainKappa method
1 0.7291919 0.4495748 PART
> getTrainPerf(j48ModG)
TrainAccuracy TrainKappa method
1 0.736 0.468541 J48
> getTrainPerf(j48ModR)
TrainAccuracy TrainKappa method
1 0.736 0.468541 J48
> getTrainPerf(jripModG)
TrainAccuracy TrainKappa method
1 0.7867273 0.5672454 JRip
> getTrainPerf(jripModR)
TrainAccuracy TrainKappa method
1 0.7867273 0.5672454 JRip
> getTrainPerf(partModG2)
TrainAccuracy TrainKappa method
1 0.7308283 0.4530283 PART
> getTrainPerf(partModR2)
TrainAccuracy TrainKappa method
1 0.7308283 0.4530283 PART
> getTrainPerf(j48ModG2)
TrainAccuracy TrainKappa method
1 0.7559394 0.5051134 J48
> getTrainPerf(j48ModR2)
TrainAccuracy TrainKappa method
1 0.7956162 0.5740688 J48
> getTrainPerf(jripModG2)
TrainAccuracy TrainKappa method
1 0.7988485 0.5935932 JRip
> getTrainPerf(jripModR2)
TrainAccuracy TrainKappa method
1 0.795596 0.5822646 JRip
Is there a reason that some of the
RWeka
classifiers only have one specified hyperparameter? The LMT model has multiple available values for hyperparemeters for grid/random search while others do not. For example, J48 hasC = 0.25
, PART hasthreshold = 0.25
andpruned = "yes"
, and JRip hasNumOpt = 1:len
.I can do some testing on what are decent values for these hyperparameters and put together a PR if necessary.