mlr-org / mlr

Machine Learning in R
https://mlr.mlr-org.com
Other
1.64k stars 405 forks source link

Learner bug list: Test learners with defaults against no params #1370

Closed MariaErdmann closed 4 years ago

MariaErdmann commented 7 years ago

Wrt to #1034 and #711 I implemented the following test and looked at all the errors. Since the test throws quite a lot of errors I will use this issue to make a list with all the errors that I get and which need to be fixed in order to get the test run through. We thought it would be better to fix the problems in seperate PRs. What do you think?

The test is not finished yet since I do not know yet how to handle default parameters that have a require statement. But I considered this case when starting debugging the first errors.

testThatLearnerHasCorrectDefaults = function(lrn, task) {

  rin = makeResampleInstance("Holdout", task = task)
  set.seed(1); m1 = train(lrn, task, subset = rin$train.inds[[1]])
  lrn2 = setHyperPars2(lrn, getDefaults(lrn$par.set))
  set.seed(1); m2 = train(lrn2, task, subset = rin$train.inds[[1]])

  p1 = predict(m1, task, subset = rin$test.inds[[1]])
  p2 = predict(m2, task, subset = rin$test.inds[[1]])
  expect_equal(performance(p1), performance(p2))
}
MariaErdmann commented 7 years ago

Classification learners (and regression if there is the same bug in both)

lrn = makeLearner("classif.avNNet", par.vals = list(repeats = 5))
mod = train(lrn, binaryclass.task)
lrn = makeLearner("classif.avNNet", par.vals = list(bag = FALSE))
mod = train(lrn, binaryclass.task)

# Problem is this part in the learner's file line 36 - 40
# ind = stri_detect_regex(nms, "repeats") --> produces a vector with one TRUE and rest is FALSE
# if (sum(ind)>0) --> is TRUE
# repeats = .learner$par.vals[[ind]] --> one [] is sufficient
# ind = stri_detect_regex(nms, "bag")
# if (sum(ind)>0)
#  bag = .learner$par.vals[[ind]]
# the same problem with the regr-learner
lrn = makeLearner("classif.binomial")
lrn2 = setHyperPars2(lrn, getDefaults(lrn$par.set))
mod2 = train(lrn2, binaryclass.task)

# Problem is here: we need to add model into the signature of the train function
lrn = makeLearner("classif.blackboost")
mod = train(lrn, binaryclass.task)
lrn2 = setHyperPars2(lrn, getDefaults(lrn$par.set))
mod2 = train(lrn2, task)

# There are 2 problems here 
# 1) the dots passed to 
# tc =  learnerArgsToControl(party::ctree_control, teststat, testtype, mincriterion, maxdepth, 
# savesplitstats, ...) in line 51 of the learner
# 2) are some parameters in the learner that cannot be passed to blackboost, including minsplit,
# minbucket, stump, nresample, maxusrrogate, mtry
lrn = makeLearner("classif.bst")
lrn2 = setHyperPars2(lrn, getDefaults(lrn$par.set))
mod = train(lrn2, binaryclass.task)

# maxcompete is missing in train function's singature and in the control.tree list
# Moreover, I am not sure if all these rpart tree control parameters are passed
# The bst code does not indicate to do so

changing the default of 'center' to 'TRUE' fixes the problem

defs$center = TRUE lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod1 = train(lrn, task) set.seed(2); mod2 = train(lrn2,task) p1 = predict(mod1, task) p2 = predict(mod2, task) all.equal(performance(p1), performance(p2))


- [ ] classif.gausspr and regr.gausspr

task = binaryclass.task lrn = makeLearner("classif.gausspr") defs = getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) mod = train(lrn2, task)

this is not a mlr problem, I think.

it is rather a bug in gausspr, I guess scaled is passed to model.frame.default

via the dots when we specify scaled = TRUE explicitely

this throws the same error

kernlab::gausspr(getTaskFormula(task), data = getTaskData(task), scaled = TRUE)

this throws no error

kernlab::gausspr(getTaskFormula(task), data = getTaskData(task))


- [ ] classif.h20.deeplearning and regr.h2o.deeplearning

lrn = makeLearner("classif.h2o.deeplearning") rin = makeResampleInstance("Holdout", task = binaryclass.task) set.seed(1); m1 = train(lrn, task) defs <- getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) mod = train(lrn2, task)

Illegal argument(s) for DeepLearning model:

DeepLearning_model_R_1480669568313_227. Details: ERRR on field:

_hidden_dropout_ratios: Must have 2 hidden layer dropout ratios.

"Solution":

1) we need to set a require statement for hidden_dropout_ratios:

hidden_dropout_ratios needs activation to be c("TanhWithDropout", "RectifierWithDropout",

"MaxoutWithDropout")

however, this does not solve the problem since the default of deeplearning seems to be broken

the right default for hidden_dropout_ratios is not 0.5 but should rather be rep(0.5, length(hidden))

we have the same default as they have

removing the ratios from the paramset or setting it to 2 solves the "problem"

defs$hidden_dropout_ratios <- 2 lrn2 <- setHyperPars2(lrn, defs) mod = train(lrn2, task)


- [x] classif.hdrda (merged)

lrn = makeLearner("classif.hdrda") rin = makeResampleInstance("Holdout", task = binaryclass.task) set.seed(1); m1 = train(lrn, binaryclass.task) defs <- getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) mod = train(lrn2, binaryclass.task)

Error in regdiscrim_estimates(x = x, y = y, cov = FALSE, prior = prior, :

unused argument (projected = FALSE)

Soultion:

'projected' is a logical indicating whether newdata have already been projected

to a q-dimensional subspace. It is only used in the predict-function

--> we need to add 'when = "predict"' in line 12 for

makeLogicalLearnerParam(id = "projected", default = FALSE)


- [ ] classif.multinom

lrn = makeLearner("classif.multinom", par.vals = list(rang = 0.3)) mod = train(lrn, binaryclass.task)

I don't know if there is a solution here, since this is a bug in nnet::multinom


- [ ] classif.earth

lrn = makeLearner("classif.earth") defs = getDefaults(lrn$par.set) defs$pmethod

in their help a default for pmethod is not specified

but from their code we can infere that the method used, when nothing is passed

is backward, since it is the first method mentioned and they use argument matching

pmethod <- match.arg1(pmethod, "pmethod")

Should we have a default in this case?

Moreover, the 'nfold' parameter needs a require statement for pmethod = "cv"

lrn2 = setHyperPars2(lrn, defs) mod = train(lrn2, binaryclass.task)


- [x] classif.nodeHarvest (merged)

lrn = makeLearner("classif.nodeHarvest") set.seed(1); m1 = train(lrn, binaryclass.task) lrn2 = setHyperPars2(lrn, getDefaults(lrn$par.set)) set.seed(1); m2 = train(lrn2, binaryclass.task) p1 = predict(m1, binaryclass.task) p2 = predict(m2, binaryclass.task) all.equal(performance(p1), performance(p2))

Solution:

default values for 'maxinter = 2', we set the default to '1'.


- [ ] classif.penalized.fusedlasso

lrn = makeLearner("classif.penalized.fusedlasso") mod = train(lrn, binaryclass.task) defs = getDefaults(lrn$par.set) defs$model

does not make sense since this is a classif learner. The values for model and

the default is wrong

lrn2 = setHyperPars2(lrn, defs) mod = train(lrn2, binaryclass.task)

in the train function the model argument is fixed to 'logistic', so I guess

we need to delete the param 'model' since for classif there is only one reasonable

prossibility ('logistic'), and leave the train function as it is


- [ ] classif.rknn

lrn = makeLearner("classif.rknn") defs = getDefaults(lrn$par.set) set.seed(1); m1 = train(lrn, binaryclass.task) lrn2 = setHyperPars2(lrn, defs) set.seed(1); m2 = train(lrn, binaryclass.task) p1 = predict(m1, binaryclass.task) p2 = predict(m2, binaryclass.task) all.equal(performance(p1), performance(p2))

but subsetting leads to difference in performance

rin = makeResampleInstance("Holdout", task = binaryclass.task) set.seed(1); m1b = train(lrn, binaryclass.task, subset = rin$train.inds[[1]]) set.seed(1); m2b = train(lrn2, binaryclass.task, subset = rin$train.inds[[1]]) p1b = predict(m1b, binaryclass.task, subset = rin$test.inds[[1]]) p2b = predict(m2b, binaryclass.task, subset = rin$test.inds[[1]]) all.equal(performance(p1b), performance(p2b))


**Regression learners**

- [ ] regr.bartMachine

I do not have an idea what the problem is
````task = subsetTask(regr.task, subset = c(1:70), features = getTaskFeatureNames(regr.task)[c(1, 3)])
lrn = makeLearner("regr.bartMachine")
defs = getDefaults(lrn$par.set)
set.seed(1); mod = train(lrn, task)
lrn2 = setHyperPars2(lrn, defs)
set.seed(1); mod2 = train(lrn2, task)

d = getTaskData(task, target.extra = TRUE)
pars = list(X = d$data, y = d$target)
pars = c(pars, lrn2$par.vals)
p = do.call(bartMachine::build_bart_machine, pars)
task = subsetTask(regr.task, subset = c(1:70), features = getTaskFeatureNames(regr.task)[c(1, 3)])
lrn = makeLearner("regr.crs")
set.seed(1); mod1 = train(lrn, task)
lrn2 = setHyperPars2(lrn, getDefaults(lrn$par.set))
set.seed(1); mod2 = train(lrn2, task)

# if degree is set to default = 3 (which is also the default of the crs help)
crs::crs(formula = getTaskFormula(task), data = getTaskData(task), degree = 3)
# we get the same error
# but we need to pass a vector of degrees with as many entries as there are
# numeric predictors in the data (in this example: 2 numeric covariates)
crs::crs(formula = getTaskFormula(task), data = getTaskData(task), degree = c(3,3))

We have the wrong default for parameter k. (mlr's default 1 vs. FNN::knn.reg's default: 3) if we change this everything is fine

lrn = makeLearner("regr.fnn")
set.seed(1); mod1 = train(lrn, task = task)
defs = getDefaults(lrn$par.set)
lrn2 = setHyperPars2(lrn, defs)
set.seed(1); mod2 = train(lrn2, task = task)
p1 = predict(mod1, task)
p2 = predict(mod2, task)
all.equal(performance(p1), performance(p2))

# compare defaults
defs$k
?FNN::knn.reg

Lambda is something else than the default value h2o.glm offers

for the default see here

formals(h2o::h2o.glm)$lambda mod1$learner.model@allparameters$lambda mod2$learner.model@allparameters$lambda formals(h2o::h2o.glm)$beta_epsilon mod1$learner.model@allparameters$beta_epsilon # default not used by h2o mod2$learner.model@allparameters$beta_epsilon

'max_active_predictors' is not a param in mlr, but it is different

for the two models

mod1$learner.model@allparameters$max_active_predictors mod1$learner.model@allparameters$solver

help page of h2o telling the user that max_active_predictors is set to 7000

when solver = "IRLSM" used. Well, the solver is "IRLSM", but my max_acitve-predictors

is 100000000

mod2$learner.model@allparameters$max_active_predictors mod2$learner.model@allparameters$solver

h2o internally uses a seed which is NULL per default. So for the two

models it is different

mlr has no param 'seed' such that we could fix this for reproducibility

mod2$learner.model@allparameters$seed mod1$learner.model@allparameters$seed


- [ ] regr.h2o.randomForest
We have similar problems as in regr.h2o.glm here: The `seed` param is also "missing" here.
Moreover, there is a precision problem for param `sample_rate`

task = subsetTask(regr.task, subset = c(1:70), features = getTaskFeatureNames(regr.task)[c(1, 3)]) lrn = makeLearner("regr.h2o.randomForest") defs = getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod1 = train(lrn, task) set.seed(2); mod2 = train(lrn2, task) sr1 = mod1$learner.model@allparameters$sample_rate sr2 = mod2$learner.model@allparameters$sample_rate sr1 sr2 all.equal(sr1, sr2) sprintf("%.20f", sr1) sprintf("%.20f", sr2) mod1$learner.model@allparameters$seed mod2$learner.model@allparameters$seed


- [x] regr.kknn
The default for the `kernel` param ist `optimal`. It is also missing in the list of possible values in the param set.

- [x] regr.laGP (merged)
Wrong default value for param `g`. It is NOT `1/1000`, but `1/10000`

- [x] regr.LiblineaRL2L1SVR
Wrong default value for param `epsilon`. It is NOT `0.1`, but `0.01` (NO!)
UPDATE (20.01.2017):
When trying to fix the learner I checked the manual once more and found that the default for `epsilon` is set correctly. The difference in the performance of the default vs. the learner without params in the posted example, does only happen if one has less than 6 attributes in the dataset. And this is not a mlr issue (I also played around with the original algo where the same is happening)

-  [ ] regr.nnet
Two problems here:
1) The error message occurs since the `linout` argument is passed directly to the `nnet`-call in the train function. If we want to change the default for `linout` in mlr, we need to pass `linout = TRUE` in the `par.vals`section and delete it from the `nnet`-calls
2) There is no default for size (as far as I can see) in `nnet()`. We should remove the default of 3 then.

task = subsetTask(regr.task, subset = c(1:70), features = getTaskFeatureNames(regr.task)[c(1, 3)]) lrn = makeLearner("regr.nnet") defs = getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod1 = train(lrn, task) set.seed(1); mod2 = train(lrn2, task) p1 = predict(mod1, task) p2 = predict(mod2, task) all.equal(performance(p1), performance(p2))

defs$linout <- NULL defs$size <- NULL lrn2 = setHyperPars2(lrn, defs) mod2 = train(lrn2, task) p2 = predict(mod2, task) all.equal(performance(p1), performance(p2))


- [x] regr.nodeHarvest (merged)
The default for `maxinter` is wrong. It is NOT `1`, but `2`.

- [ ] regr.pcr
 I think, the methods passed here, do not exist for pcr. For pcr there are only two options possible: "svdpc" (the default), and "model.frame".
 However, "model.frame" may not be suitable in mlr context, since this  returns an object of class 'data.frame' and then predict doesn't work any more
 By the way, I couldn't find a test for pcr.

lrn = makeLearner("regr.pcr", par.vals = list(method = "widekernelpls")) mod = train(lrn, task)


- [ ] regr.rvm
The default for param `alpha` is wrong: the method for formula calls the method for matrix at some point. And the method for matrix has the following default for alpha: `alpha = ncol(as.matrix(x))`.
Thus, we shouldn't have a default for lambda.

task = subsetTask(regr.task, subset = c(1:70), features = getTaskFeatureNames(regr.task)[c(1, 3)]) lrn = makeLearner("regr.rvm") defs = getDefaults(lrn$par.set) lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod1 = train(lrn, task) set.seed(1); mod2 = train(lrn2, task)

Since the default kernel is 'rbdot' we need to set all default params for other

kernels to NULL

defs$kernel defs$degree <- NULL defs$scale <- NULL defs$offset <- NULL defs$order <- NULL lrn2 = setHyperPars2(lrn, defs) mod2 = train(lrn2, task) p1 = predict(mod1, task) p2 = predict(mod2, task) all.equal(performance(p1), performance(p2))

But there is still a difference, because of the default value of param alpha

defs$alpha <- NULL lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod2 = train(lrn2, task) p2 = predict(mod2, task) all.equal(performance(p1), performance(p2))


- [ ] regr.xgboost
The default for param `lambda` is wrong: It is `1` and not `0`.
Checking all the parameters of xgboost I saw that mlr is missing some parameters (see here:
[http://xgboost.readthedocs.io/en/latest/parameter.html#parameters-in-r-package](url)). Hopefully, they now document only the parameters that come with the CRAN version and not all parameters (including those that are only available for the devel version).
So these are the parameters mlr misses: `max_delta_step`, `tree_method`, `sketch_eps`, `update`, `refresh_leaf`, `process_type`, `tweedie_variance_power`. And these two parameters for `booster = dart_booster`: `sample_type`,`one_drop`

**Survival learners**

- [x] surv.cforest
and all other cforest learners. There is no bug here, just recognized, that the cforest docu says 

> `fraction`  fraction of number of observations to draw without replacement (only relevant if replace = FALSE)

So we should probably set `requires = quote(replace == FALSE)` in the param set for `fraction`

- [ ] surv.cv.CoxBoost

lrn = makeLearner("surv.cv.CoxBoost", par.vals = list(K = 12)) mod = train(lrn, surv.task)

 Problem: to determine the optimal number of steps, cv.CoxBooost is evaluated with the parameters passed in par.vals. The result of the evalution becomes parameter `stepno` and parameter `maxstepno` is set to NULL. Then the `CoxBoost` function is called with all the params defined in par.vals + `stepno`. But function `CoxBoost` does not have the same parameters as `cv.CoxBoost` i.e. `K` is not a parameter in `CoxBoost`.

Moreover, the default value for `upload.x` is `TRUE` not `FALSE`.

Furthermore, there are a couple of parameters missing (or set to NULL) that I am not certain about if we want to have them:
  - `unpen.index` (vector of length p.unpen with indices of mandatory covariates,  where parameter estimation should be performed unpenalized.) is always set to NULL.
  - `sf.scheme` (scheme for changing step sizes via stepsize.factor)
  - `pendistmat`: Might be useful since 

> According to this information penalty changes due to stepsize.factor < 1 are propagated

  But then we will probably need other alues then NULL for p.unpen and parameter `connected.index`, which is also missing.

  - `x.is.01`

- [x]  surv.cvglmnet
Problem is that we use `glmnet::glmnet.control()` in line 52. Let's take a look at the docu

> If called with no arguments, glmnet.control() returns a list with the current settings of these parameters. Any arguments included in the call sets those parameters to the new values, and then silently returns. The values set are persistent for the duration of the R session.

Thus, what happens is that we do NOT replace all parameters of `glmnet.control`, but only those that are displayed as being the current setting. For example is `factory` not part of this current setting, hence it is not possible to change the parameter. Moreover, this will throw an error.

lrn = makeLearner("surv.cvglmnet", par.vals = list(factory = FALSE)) mod = train(lrn, surv.task)


 We only need to change line 52 `saved.ctrl = glmnet::glmnet.control()` to `saved.ctrl = formals(glmnet::glmnet.control)`
 We should check which learners are also affect bythis (I think saw this now 2-3 times).

**Cluster analysis**

- [ ] measure db cannot handle missing values 
 But we have no note to inform the user and `clusterSim::index.DB` does not have this either.

task = subsetTask(noclass.task, subset = c(1:70), features = getTaskFeatureNames(noclass.task)) rin = makeResampleInstance("Holdout", task) lrn = makeLearner("cluster.dbscan") set.seed(1); mod = train(lrn, task, subset = rin$train.inds[[1]]) p1 = predict(mod, task, subset = rin$test.inds[[1]]) p1$data

The second to last line has a NA value for the response

performance(p1, task = task)

Let's remove the value

p1$data = p1$data[-(nrow(p1$data)-1),]

Now we can calculate the db measure

performance(p1, task = task)


-  [x] cluster.cmeans

lrn = makeLearner("cluster.cmeans") defs = getDefaults(getParamSet(lrn)) lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod2 = train(lrn2, task)

There is no param called `nstart`. We need to remove this param.
Furthermore, it is possible to pass a `reltol` param as a control param. This one is missing.

**Multilabel.classification**

- [x] multilabel.cforest

task = subsetTask(multilabel.task, subset = c(1:70)) lrn = makeLearner("multilabel.cforest") defs = getDefaults(getParamSet(lrn)) lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod2= train(lrn2, task)

There are no params `pvalue`, `minprob` and `randomsplits` in the controls of cforest, according to the error message. But actaully these parameters exists. See here

party::cforest_control()

But the `cforest_control()` function calls `ctree_control` with ... (where the pvalue for instance lives in) and `ctree_control` does not have such a parameter.
Since this is not the first time I stumbled over the controsl of cforest I would suggest to discuess whether we should use `ctree_control` rather than `cforest_control`. Furthermore, why don't we use what cforest suggests: `cforest_unbiased()`
I remember that it has to do with reproducability.

- [x] multilabel.randomForestSRC (merged)

task = subsetTask(multilabel.task, subset = c(1:70)) lrn = makeLearner("multilabel.randomForestSRC") defs = getDefaults(getParamSet(lrn)) defs lrn2 = setHyperPars2(lrn, defs) set.seed(1); mod = train(lrn, task) set.seed(1); mod = train(lrn2, task)


- default `nodesize = NULL` not `1`, special.vals = list( NULL) needs to be added
- default `nodedepth = NULL` not `-1, special.vals = list( NULL) needs to be added
- default `splitrule = NULL`not `gingi, special.vals = list( NULL) needs to be added
- default `membership = FALSE` not `TRUE`
larskotthoff commented 7 years ago

Sounds good, thanks for doing this!

MariaErdmann commented 7 years ago

Finished the list! Now, I and whoever feels like it can fix a learner in a PR. I tried to check all params whenever I encountered a bug,but please, check all the parameters once more. Four eyes see more than two :-) And please, just set a check mark when you fix a learner, such that only one person works on a learner. According to our monday meetup the following should be applied for parameters which have a default that is NULL in the algorithms help page and whenever we don't have a default for that parameter or we have another default than NULL for feasability in the past: Then set the param to NULL and use special.vals = list( NULL).

Coorsaa commented 7 years ago

I'm working now on the (cv)glmnet learners

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.