tidymodels / model-implementation-principles

recommendations for creating R modeling packages
https://tidymodels.github.io/model-implementation-principles/
41 stars 4 forks source link

Check for integer values for arguments #8

Open EmilHvitfeldt opened 5 years ago

EmilHvitfeldt commented 5 years ago

I don't know if this fits within this repo so please bear with me.

I have been thinking about what happens when we check the type of the arguments we pass into our models, and for the most part that task is fairly easy. But I feel we have overlooked the case when we are concerned with argument such as times, epochs and mtry that require integers. Often the approach is to round down the double silently which I think is a little dangerous.

Examples

library(ranger)

ranger(
  mpg ~ ., 
  data = mtcars, 
  mtry = 3.8, 
  num.trees = 20.2, 
  importance = 'impurity'
)
#> Ranger result
#> 
#> Call:
#>  ranger(mpg ~ ., data = mtcars, mtry = 3.8, num.trees = 20.2,      importance = "impurity") 
#> 
#> Type:                             Regression 
#> Number of trees:                  20 
#> Sample size:                      32 
#> Number of independent variables:  10 
#> Mtry:                             3 
#> Target node size:                 5 
#> Variable importance mode:         impurity 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       5.142068 
#> R squared (OOB):                  0.8584392

library(parsnip)

rand_forest(mode = "regression", mtry = 2.3) %>%
  fit(mpg ~ ., 
      data = mtcars,
      engine = "ranger")
#> parsnip model object
#> 
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(formula = formula, data = data, mtry = 2.3, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
#> 
#> Type:                             Regression 
#> Number of trees:                  500 
#> Sample size:                      32 
#> Number of independent variables:  10 
#> Mtry:                             2 
#> Target node size:                 5 
#> Variable importance mode:         none 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       6.186759 
#> R squared (OOB):                  0.829679

I feel like this problem is already being looked at in the vctrs package with vec_cast but might be something you should consider when implementing models in R.

DavisVaughan commented 5 years ago

I like this as a general principle for writing functions, and vctrs's lossy cast warning functions are pretty nice for this. What's nice about it is that integerish things like the numeric 2 don't trigger the warning but 2.2 would. I personally don't think this is specific enough to be a model principle, but could see it living as a general principle in the functions chapter of the next version of advanced R (but I have no control over that). Maybe Max has a different opinion though because it is a useful point overall.

alexpghayes commented 5 years ago

Certainly I think we should have a section about input validation, potentially listing some easy tests to do to avoid mistakes like this.