Closed MariaErdmann closed 4 years ago
Classification learners (and regression if there is the same bug in both)
lrn = makeLearner("classif.avNNet", par.vals = list(repeats = 5))
mod = train(lrn, binaryclass.task)
lrn = makeLearner("classif.avNNet", par.vals = list(bag = FALSE))
mod = train(lrn, binaryclass.task)
# Problem is this part in the learner's file line 36 - 40
# ind = stri_detect_regex(nms, "repeats") --> produces a vector with one TRUE and rest is FALSE
# if (sum(ind)>0) --> is TRUE
# repeats = .learner$par.vals[[ind]] --> one [] is sufficient
# ind = stri_detect_regex(nms, "bag")
# if (sum(ind)>0)
# bag = .learner$par.vals[[ind]]
# the same problem with the regr-learner
lrn = makeLearner("classif.binomial")
lrn2 = setHyperPars2(lrn, getDefaults(lrn$par.set))
mod2 = train(lrn2, binaryclass.task)
# Problem is here: we need to add model into the signature of the train function
lrn = makeLearner("classif.blackboost")
mod = train(lrn, binaryclass.task)
lrn2 = setHyperPars2(lrn, getDefaults(lrn$par.set))
mod2 = train(lrn2, task)
# There are 2 problems here
# 1) the dots passed to
# tc = learnerArgsToControl(party::ctree_control, teststat, testtype, mincriterion, maxdepth,
# savesplitstats, ...) in line 51 of the learner
# 2) are some parameters in the learner that cannot be passed to blackboost, including minsplit,
# minbucket, stump, nresample, maxusrrogate, mtry
lrn = makeLearner("classif.bst")
lrn2 = setHyperPars2(lrn, getDefaults(lrn$par.set))
mod = train(lrn2, binaryclass.task)
# maxcompete is missing in train function's singature and in the control.tree list
# Moreover, I am not sure if all these rpart tree control parameters are passed
# The bst code does not indicate to do so
[x] classif.cvglmnet and classif.glmnet and regr.glmnet
lrn = makeLearner("classif.cvglmnet")
defs = getDefaults(lrn$par.set)
lrn2 = setHyperPars2(lrn, defs)
mod = train(lrn2, binaryclass.task)
# SimpleError in glmnet(x, y, weights = weights, offset = offset, lambda = lambda, ...):
# unused argument (factory = FALSE)
# Problem is that we use
glmnet::glmnet.control()
# and there factory is missing. However, you can find it in the formals
formals(glmnet::glmnet.control)
# probably we can use formals(glmnet::glmnet.control) in line 56 to start with
# but some more adaptations need to be done here
[x] regr.glmboost and classif.glmboost (merged) The default for parameter 'center' is wrong. It should be 'TRUE' not 'FALSE'
lrn = makeLearner("regr.glmboost")
defs = getDefaults(lrn$par.set)
lrn2 = setHyperPars2(lrn, defs)
set.seed(1); mod1 = train(lrn, task)
set.seed(2); mod2 = train(lrn2,task)
p1 = predict(mod1, task)
p2 = predict(mod2, task)
all.equal(performance(p1), performance(p2))
defs$center = TRUE lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod1 = train(lrn, task) set.seed(2); mod2 = train(lrn2,task) p1 = predict(mod1, task) p2 = predict(mod2, task) all.equal(performance(p1), performance(p2))
- [ ] classif.gausspr and regr.gausspr
task = binaryclass.task lrn = makeLearner("classif.gausspr") defs = getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) mod = train(lrn2, task)
kernlab::gausspr(getTaskFormula(task), data = getTaskData(task), scaled = TRUE)
kernlab::gausspr(getTaskFormula(task), data = getTaskData(task))
- [ ] classif.h20.deeplearning and regr.h2o.deeplearning
lrn = makeLearner("classif.h2o.deeplearning") rin = makeResampleInstance("Holdout", task = binaryclass.task) set.seed(1); m1 = train(lrn, task) defs <- getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) mod = train(lrn2, task)
defs$hidden_dropout_ratios <- 2 lrn2 <- setHyperPars2(lrn, defs) mod = train(lrn2, task)
- [x] classif.hdrda (merged)
lrn = makeLearner("classif.hdrda") rin = makeResampleInstance("Holdout", task = binaryclass.task) set.seed(1); m1 = train(lrn, binaryclass.task) defs <- getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) mod = train(lrn2, binaryclass.task)
- [ ] classif.multinom
lrn = makeLearner("classif.multinom", par.vals = list(rang = 0.3)) mod = train(lrn, binaryclass.task)
- [ ] classif.earth
lrn = makeLearner("classif.earth") defs = getDefaults(lrn$par.set) defs$pmethod
lrn2 = setHyperPars2(lrn, defs) mod = train(lrn2, binaryclass.task)
- [x] classif.nodeHarvest (merged)
lrn = makeLearner("classif.nodeHarvest") set.seed(1); m1 = train(lrn, binaryclass.task) lrn2 = setHyperPars2(lrn, getDefaults(lrn$par.set)) set.seed(1); m2 = train(lrn2, binaryclass.task) p1 = predict(m1, binaryclass.task) p2 = predict(m2, binaryclass.task) all.equal(performance(p1), performance(p2))
- [ ] classif.penalized.fusedlasso
lrn = makeLearner("classif.penalized.fusedlasso") mod = train(lrn, binaryclass.task) defs = getDefaults(lrn$par.set) defs$model
lrn2 = setHyperPars2(lrn, defs) mod = train(lrn2, binaryclass.task)
- [ ] classif.rknn
lrn = makeLearner("classif.rknn") defs = getDefaults(lrn$par.set) set.seed(1); m1 = train(lrn, binaryclass.task) lrn2 = setHyperPars2(lrn, defs) set.seed(1); m2 = train(lrn, binaryclass.task) p1 = predict(m1, binaryclass.task) p2 = predict(m2, binaryclass.task) all.equal(performance(p1), performance(p2))
rin = makeResampleInstance("Holdout", task = binaryclass.task) set.seed(1); m1b = train(lrn, binaryclass.task, subset = rin$train.inds[[1]]) set.seed(1); m2b = train(lrn2, binaryclass.task, subset = rin$train.inds[[1]]) p1b = predict(m1b, binaryclass.task, subset = rin$test.inds[[1]]) p2b = predict(m2b, binaryclass.task, subset = rin$test.inds[[1]]) all.equal(performance(p1b), performance(p2b))
**Regression learners**
- [ ] regr.bartMachine
I do not have an idea what the problem is
````task = subsetTask(regr.task, subset = c(1:70), features = getTaskFeatureNames(regr.task)[c(1, 3)])
lrn = makeLearner("regr.bartMachine")
defs = getDefaults(lrn$par.set)
set.seed(1); mod = train(lrn, task)
lrn2 = setHyperPars2(lrn, defs)
set.seed(1); mod2 = train(lrn2, task)
d = getTaskData(task, target.extra = TRUE)
pars = list(X = d$data, y = d$target)
pars = c(pars, lrn2$par.vals)
p = do.call(bartMachine::build_bart_machine, pars)
task = subsetTask(regr.task, subset = c(1:70), features = getTaskFeatureNames(regr.task)[c(1, 3)])
lrn = makeLearner("regr.crs")
set.seed(1); mod1 = train(lrn, task)
lrn2 = setHyperPars2(lrn, getDefaults(lrn$par.set))
set.seed(1); mod2 = train(lrn2, task)
# if degree is set to default = 3 (which is also the default of the crs help)
crs::crs(formula = getTaskFormula(task), data = getTaskData(task), degree = 3)
# we get the same error
# but we need to pass a vector of degrees with as many entries as there are
# numeric predictors in the data (in this example: 2 numeric covariates)
crs::crs(formula = getTaskFormula(task), data = getTaskData(task), degree = c(3,3))
We have the wrong default for parameter k
. (mlr's default 1
vs. FNN::knn.reg
's default: 3
)
if we change this everything is fine
lrn = makeLearner("regr.fnn")
set.seed(1); mod1 = train(lrn, task = task)
defs = getDefaults(lrn$par.set)
lrn2 = setHyperPars2(lrn, defs)
set.seed(1); mod2 = train(lrn2, task = task)
p1 = predict(mod1, task)
p2 = predict(mod2, task)
all.equal(performance(p1), performance(p2))
# compare defaults
defs$k
?FNN::knn.reg
lambda
or beta_epsilon
h2o.glm
does not use its default value.is not a parameter in mlr and is used inconsistently within
h2o 3)
h2ouses a
seed` parameter. mlr does not have this param, so we cannot fix the 'seed' to get reproducible results.
task = subsetTask(regr.task, subset = c(1:70), features = getTaskFeatureNames(regr.task)[c(1, 3)])
lrn = makeLearner("regr.h2o.glm")
defs = getDefaults(lrn$par.set)
lrn2 = setHyperPars2(lrn, defs)
set.seed(1); mod1 = train(lrn, task)
set.seed(1); mod2 = train(lrn2, task)
p1 = predict(mod1, task)
p2 = predict(mod2, task)
all.equal(performance(p1), performance(p2))
formals(h2o::h2o.glm)$lambda mod1$learner.model@allparameters$lambda mod2$learner.model@allparameters$lambda formals(h2o::h2o.glm)$beta_epsilon mod1$learner.model@allparameters$beta_epsilon # default not used by h2o mod2$learner.model@allparameters$beta_epsilon
mod1$learner.model@allparameters$max_active_predictors mod1$learner.model@allparameters$solver
mod2$learner.model@allparameters$max_active_predictors mod2$learner.model@allparameters$solver
mod2$learner.model@allparameters$seed mod1$learner.model@allparameters$seed
- [ ] regr.h2o.randomForest
We have similar problems as in regr.h2o.glm here: The `seed` param is also "missing" here.
Moreover, there is a precision problem for param `sample_rate`
task = subsetTask(regr.task, subset = c(1:70), features = getTaskFeatureNames(regr.task)[c(1, 3)]) lrn = makeLearner("regr.h2o.randomForest") defs = getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod1 = train(lrn, task) set.seed(2); mod2 = train(lrn2, task) sr1 = mod1$learner.model@allparameters$sample_rate sr2 = mod2$learner.model@allparameters$sample_rate sr1 sr2 all.equal(sr1, sr2) sprintf("%.20f", sr1) sprintf("%.20f", sr2) mod1$learner.model@allparameters$seed mod2$learner.model@allparameters$seed
- [x] regr.kknn
The default for the `kernel` param ist `optimal`. It is also missing in the list of possible values in the param set.
- [x] regr.laGP (merged)
Wrong default value for param `g`. It is NOT `1/1000`, but `1/10000`
- [x] regr.LiblineaRL2L1SVR
Wrong default value for param `epsilon`. It is NOT `0.1`, but `0.01` (NO!)
UPDATE (20.01.2017):
When trying to fix the learner I checked the manual once more and found that the default for `epsilon` is set correctly. The difference in the performance of the default vs. the learner without params in the posted example, does only happen if one has less than 6 attributes in the dataset. And this is not a mlr issue (I also played around with the original algo where the same is happening)
- [ ] regr.nnet
Two problems here:
1) The error message occurs since the `linout` argument is passed directly to the `nnet`-call in the train function. If we want to change the default for `linout` in mlr, we need to pass `linout = TRUE` in the `par.vals`section and delete it from the `nnet`-calls
2) There is no default for size (as far as I can see) in `nnet()`. We should remove the default of 3 then.
task = subsetTask(regr.task, subset = c(1:70), features = getTaskFeatureNames(regr.task)[c(1, 3)]) lrn = makeLearner("regr.nnet") defs = getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod1 = train(lrn, task) set.seed(1); mod2 = train(lrn2, task) p1 = predict(mod1, task) p2 = predict(mod2, task) all.equal(performance(p1), performance(p2))
defs$linout <- NULL defs$size <- NULL lrn2 = setHyperPars2(lrn, defs) mod2 = train(lrn2, task) p2 = predict(mod2, task) all.equal(performance(p1), performance(p2))
- [x] regr.nodeHarvest (merged)
The default for `maxinter` is wrong. It is NOT `1`, but `2`.
- [ ] regr.pcr
I think, the methods passed here, do not exist for pcr. For pcr there are only two options possible: "svdpc" (the default), and "model.frame".
However, "model.frame" may not be suitable in mlr context, since this returns an object of class 'data.frame' and then predict doesn't work any more
By the way, I couldn't find a test for pcr.
lrn = makeLearner("regr.pcr", par.vals = list(method = "widekernelpls")) mod = train(lrn, task)
- [ ] regr.rvm
The default for param `alpha` is wrong: the method for formula calls the method for matrix at some point. And the method for matrix has the following default for alpha: `alpha = ncol(as.matrix(x))`.
Thus, we shouldn't have a default for lambda.
task = subsetTask(regr.task, subset = c(1:70), features = getTaskFeatureNames(regr.task)[c(1, 3)]) lrn = makeLearner("regr.rvm") defs = getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod1 = train(lrn, task) set.seed(1); mod2 = train(lrn2, task)
defs$kernel defs$degree <- NULL defs$scale <- NULL defs$offset <- NULL defs$order <- NULL lrn2 = setHyperPars2(lrn, defs) mod2 = train(lrn2, task) p1 = predict(mod1, task) p2 = predict(mod2, task) all.equal(performance(p1), performance(p2))
defs$alpha <- NULL lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod2 = train(lrn2, task) p2 = predict(mod2, task) all.equal(performance(p1), performance(p2))
- [ ] regr.xgboost
The default for param `lambda` is wrong: It is `1` and not `0`.
Checking all the parameters of xgboost I saw that mlr is missing some parameters (see here:
[http://xgboost.readthedocs.io/en/latest/parameter.html#parameters-in-r-package](url)). Hopefully, they now document only the parameters that come with the CRAN version and not all parameters (including those that are only available for the devel version).
So these are the parameters mlr misses: `max_delta_step`, `tree_method`, `sketch_eps`, `update`, `refresh_leaf`, `process_type`, `tweedie_variance_power`. And these two parameters for `booster = dart_booster`: `sample_type`,`one_drop`
**Survival learners**
- [x] surv.cforest
and all other cforest learners. There is no bug here, just recognized, that the cforest docu says
> `fraction` fraction of number of observations to draw without replacement (only relevant if replace = FALSE)
So we should probably set `requires = quote(replace == FALSE)` in the param set for `fraction`
- [ ] surv.cv.CoxBoost
lrn = makeLearner("surv.cv.CoxBoost", par.vals = list(K = 12)) mod = train(lrn, surv.task)
Problem: to determine the optimal number of steps, cv.CoxBooost is evaluated with the parameters passed in par.vals. The result of the evalution becomes parameter `stepno` and parameter `maxstepno` is set to NULL. Then the `CoxBoost` function is called with all the params defined in par.vals + `stepno`. But function `CoxBoost` does not have the same parameters as `cv.CoxBoost` i.e. `K` is not a parameter in `CoxBoost`.
Moreover, the default value for `upload.x` is `TRUE` not `FALSE`.
Furthermore, there are a couple of parameters missing (or set to NULL) that I am not certain about if we want to have them:
- `unpen.index` (vector of length p.unpen with indices of mandatory covariates, where parameter estimation should be performed unpenalized.) is always set to NULL.
- `sf.scheme` (scheme for changing step sizes via stepsize.factor)
- `pendistmat`: Might be useful since
> According to this information penalty changes due to stepsize.factor < 1 are propagated
But then we will probably need other alues then NULL for p.unpen and parameter `connected.index`, which is also missing.
- `x.is.01`
- [x] surv.cvglmnet
Problem is that we use `glmnet::glmnet.control()` in line 52. Let's take a look at the docu
> If called with no arguments, glmnet.control() returns a list with the current settings of these parameters. Any arguments included in the call sets those parameters to the new values, and then silently returns. The values set are persistent for the duration of the R session.
Thus, what happens is that we do NOT replace all parameters of `glmnet.control`, but only those that are displayed as being the current setting. For example is `factory` not part of this current setting, hence it is not possible to change the parameter. Moreover, this will throw an error.
lrn = makeLearner("surv.cvglmnet", par.vals = list(factory = FALSE)) mod = train(lrn, surv.task)
We only need to change line 52 `saved.ctrl = glmnet::glmnet.control()` to `saved.ctrl = formals(glmnet::glmnet.control)`
We should check which learners are also affect bythis (I think saw this now 2-3 times).
**Cluster analysis**
- [ ] measure db cannot handle missing values
But we have no note to inform the user and `clusterSim::index.DB` does not have this either.
task = subsetTask(noclass.task, subset = c(1:70), features = getTaskFeatureNames(noclass.task)) rin = makeResampleInstance("Holdout", task) lrn = makeLearner("cluster.dbscan") set.seed(1); mod = train(lrn, task, subset = rin$train.inds[[1]]) p1 = predict(mod, task, subset = rin$test.inds[[1]]) p1$data
performance(p1, task = task)
p1$data = p1$data[-(nrow(p1$data)-1),]
performance(p1, task = task)
- [x] cluster.cmeans
lrn = makeLearner("cluster.cmeans") defs = getDefaults(getParamSet(lrn)) lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod2 = train(lrn2, task)
There is no param called `nstart`. We need to remove this param.
Furthermore, it is possible to pass a `reltol` param as a control param. This one is missing.
**Multilabel.classification**
- [x] multilabel.cforest
task = subsetTask(multilabel.task, subset = c(1:70)) lrn = makeLearner("multilabel.cforest") defs = getDefaults(getParamSet(lrn)) lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod2= train(lrn2, task)
There are no params `pvalue`, `minprob` and `randomsplits` in the controls of cforest, according to the error message. But actaully these parameters exists. See here
party::cforest_control()
But the `cforest_control()` function calls `ctree_control` with ... (where the pvalue for instance lives in) and `ctree_control` does not have such a parameter.
Since this is not the first time I stumbled over the controsl of cforest I would suggest to discuess whether we should use `ctree_control` rather than `cforest_control`. Furthermore, why don't we use what cforest suggests: `cforest_unbiased()`
I remember that it has to do with reproducability.
- [x] multilabel.randomForestSRC (merged)
task = subsetTask(multilabel.task, subset = c(1:70)) lrn = makeLearner("multilabel.randomForestSRC") defs = getDefaults(getParamSet(lrn)) defs lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod = train(lrn, task) set.seed(1); mod = train(lrn2, task)
- default `nodesize = NULL` not `1`, special.vals = list( NULL) needs to be added
- default `nodedepth = NULL` not `-1, special.vals = list( NULL) needs to be added
- default `splitrule = NULL`not `gingi, special.vals = list( NULL) needs to be added
- default `membership = FALSE` not `TRUE`
Sounds good, thanks for doing this!
Finished the list!
Now, I and whoever feels like it can fix a learner in a PR. I tried to check all params whenever I encountered a bug,but please, check all the parameters once more. Four eyes see more than two :-) And please, just set a check mark when you fix a learner, such that only one person works on a learner.
According to our monday meetup the following should be applied for parameters which have a default that is NULL
in the algorithms help page and whenever we don't have a default for that parameter or we have another default than NULL
for feasability in the past: Then set the param to NULL
and use special.vals = list( NULL)
.
I'm working now on the (cv)glmnet learners
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Wrt to #1034 and #711 I implemented the following test and looked at all the errors. Since the test throws quite a lot of errors I will use this issue to make a list with all the errors that I get and which need to be fixed in order to get the test run through. We thought it would be better to fix the problems in seperate PRs. What do you think?
The test is not finished yet since I do not know yet how to handle default parameters that have a require statement. But I considered this case when starting debugging the first errors.