Breaking Backward Compatibility

Not an issue but a running list of things that we might change in a single release that would break backward compatibility (which I have almost always avoided).

Suggestions and comments are welcome...

change how preProcess options are passed from train (via the preProcess argument to train as opposed to being in trainControl)
move metric argument in sbf, rfe and other feature selection routines to their control objects
add a parameter range module to each model. Right now this is baked into the grid module
move function modules out of list and into an environment?
pass ... to the grid function
refactor options in control to lists of arguments? For example, for resampling there are options

trainControl(method, number, repeats, p, initialWindow, horizon, fixedWindow, ...)

It might be cleaner to change this to something like:

trainControl(resample = list(method, number, repeats, p, initialWindow, horizon, fixedWindow))

Overall, it would like like:

trainControl(resample = list(method = "boot", 
                             number = ifelse(grepl("cv", method), 10, 25),
                             repeats = ifelse(grepl("cv", method), 1, number), 
                             p = 0.75, 
                             initialWindow = NULL,
                             horizon = 1,
                             fixedWindow = TRUE, 
                             verboseIter = FALSE, 
                             returnData = TRUE),
             search = "grid",
             returnResamp = "final", 
             savePredictions = FALSE, 
             classProbs = FALSE,
             functions = list(summary= defaultSummary, 
                              selection = "best"),
             preProcOptions = list(thresh = 0.95, 
                                   ICAcomp = 3, 
                                   k = 5), 
             sampling = NULL,
             index = list(model = NULL, holdout = NULL, final = NULL), 
             timingSamps = 0,
             predictionBounds = rep(FALSE, 2), 
             seeds = NA, 
             adaptive = list(min = 5,
                             alpha = 0.05, 
                             method = "gls", 
                             complete = TRUE), 
             trim = FALSE,
             allowParallel = TRUE)

Breaking API is really bad unless necessary for some major reason. A lot of this could be accomplished without breaking any backwards compatibility.

Moving metric from sbf() and rfe() to the control function would break consistency with its location in train(). Alternatively, backwards compatibility could be maintained by adding this to the control functions and still keeping it where it is. Duplication like this isn't so great, however.
A parameter range for each model is very interesting. Easily being able to extract these ranges for a given model would be useful. This would greatly facilitate constructing custom tune grids. Would this be intended ultimately to replace the code generating the tune grid? There is a lot of logic included for each model to include or sample appropriately for combinations of tuning parameters for each model, so replacing would seem very difficult.
I would imagine method, number, and repeats are some of the most frequently used options in trainControl(). While this would add consistency of grouping them together into a list (like adaptive), it becomes more verbose. At the this point, it encourages the user to define a list outside of trainControl() instead of using it inline (much like using defining trainControl() outside of train() instead of using it inline), so that's not all bad. How about a backwards compatible change? The arguments number, repeats, etc remain and are used unless a list is passed to the argument resample?

I'm not so concerned with backwards compatibility. Considering the future users that benefit from the changes is better. Constantly worrying about backwards compatibility leads to too many controls in the code and makes debugging worse. Besides, nowadays you have the checkpoint package. This is what I started using for production code. This keeps packages to the same version as when I finished the project. No more worrying about something breaking because someone made a change to a package.

Adding a (adjustable?) parameter range in each models would be handy, especially in combination with the ... option in grid. That saves me, and probably a lot of others, building custom models just for extending the grid parameters. Combined with the existing adaptive resampling or combining it with baysian optimization might lead to some interesting parameter tuning results.
Adding ... will lead to a reduction in the number of models available in caret. Which again helps in future maintenance.
Using a list of arguments cleans up trainControl and shows which pieces belong together. It will probably lead to people defining these parts outside trainControl but that is probably not a bad thing. It is what I tend to do if I have these kind of possibilities.
I would move the preProcess options in train to trainControl. Then all preprocessing is defined in one place.

All in all it sounds like a lot of work. Might be a good business case for the R consortium.

topepo / caret

Breaking Backward Compatibility #508