Stata Crossvalidation

crossvalidation

The crossvalidate package includes several commands and a Mata library that provide a range of possible cross-validation techniques that can be used with any Stata estimation command returning results in e(). For the majority of users and use cases, the prefix commands (see xv and xvloo) should handle your needs. If, however, you need to implement something a bit different from generic use cases, the package also includes lower level commands that can save you time from having to code the entire cross-validation process. These commands are named after the four-steps found in all cross-validation work: splitit, fitit, predictit, and validateit. There are also a few utility commands that take care of the metaprogramming tasks needed to allow these commands to be applied to the correct fold/split of the data.

Lastly, we implemented the majority of validation metrics that can be found in the R package yardstick in our Mata library so you don't have to. However, if you want to implement your own validation metric that is possible and easy to do (see information below which specifies the function signature to use for your Mata function that will implement your metric) and easy to use with the existing tools (i.e., pass the name of your function as an argument to the metric or monitors options of either the prefix commands or validateit and it will handle the rest).

Examples:

// Load example dataset
sysuse auto.dta, clear

// Simple train/test (TT) split
xv 0.8, pstub(ttpred) metric(mse): reg price mpg length

// Simple train/validation/test (TVT) split
xv 0.6 0.2, pstub(tvtpred) metric(mse) monitors(mape smape): reg price mpg length

// Leave-One-Out cross valiation with a train/test split
xvloo 0.8, pstub(ttloopred) metric(mse): reg price mpg length

// LOO TVT split
xvloo 0.6 0.2, pstub(tvtloopred) metric(mse): reg price mpg length

// K-Fold TT split
xv 0.8, pstub(ttkfpred) metric(mae) kfold(5): reg price mpg length if !mi(rep78)

// K-Fold TVT split
xv 0.6 0.2, pstub(tvtkfpred) metric(mbe) kfold(3): reg price mpg length, vce(rob)

// Clustered K-Fold TT Split
xv 0.8, metric(phl) uid(rep78) kfold(4) display retain: reg price mpg length, vce(rob)

TODO

[ ] Standardize language in help files
[x] Finish writing test cases for ADO commands
[x] Adding optional arguments to continuous metrics where possible
[ ] Adding optional arguments to binary metrics where possible
[ ] Adding optional arguments to multiclass metrics where possible

libxv

Metrics/Monitors

Method Signature

The program will allow users to define their own metrics/monitors that are not contained in libcrossvalidate. In order to do this, users must implement a specific method/function signature:

real scalar metric(string scalar pred, string scalar obs, 
                   string scalar touse, | transmorphic matrix opts)

The function must return a real valued scalar and take three arguments. The three arguments are used to access the data that would be used to compute the metrics/monitors and to provide a method to pass optional arguments to the underlying functions if supported.

Data access

Within the function body, we recommend using the following pattern to access the data needed to compute any metrics/monitors:

real colvector yhat, y
yhat = st_data(., pred, touse)
y = st_data(., obs, touse)

The programs in the cross validate package will handle the construction of the variables and passing them to the function name that users pass to the programs.

Building the library

Once we are ready to build the Mata library we should do the following using an instance of Stata 15.

// Clear everything out of Mata
mata: mata clear 

// Define all of the Mata functions in memory
run crossvalidate.mata

// If the library is already built, use this instead:
lmbuild libxv, replace dir(`"`c(pwd)'"')

Prefix Commands

xv

xv # [#], MEtric(string asis) [seed(integer) Uid(varlist) TPoint(string asis)
             SPLit(string asis) KFold(integer) RESults(string asis) 
             fitnm(string asis) Classes(integer) PStub(string asis) noall 
             MOnitors(string asis) DISplay RETain valnm(string asis)
             PMethod(string asis) POpts(string asis)] : 
             estimation command ...

Syntax and options

# - The proportion of the data set to allocate to the training set. _Note if the training allocation is set to unity, it will automatically trigger the noall option.
[#] - The proportion of the data set to allocate to the validation set.
metric(string asis) - the name of a function from libxv or a user-defined function. Optional arguments for the metric function can be passed as a matrix.
For example, metric(ccc((1))) or metric(ccc(("y"))). See help documentation for cases where options are supported, as well as the required format for those options.
kfold(integer) - An optional parameter that implements K-Fold cross-validation with the number of folds specified by the user. The default is 1, which is the equivalent of simple train/test or train/validation/test splits.
seed(integer) - to set the pseudo-random number generator seed
uid(varlist) - a variable list for clustered sampling/splitting. Note, the variable list must be specified the order from highest to lowest level of nesting of the data; cross-classified sampling is not currently supported.
tpoint(string asis) - a numeric, td(), tc(), or tC() value.
split(string asis) - a new variable name that will store the split identifiers in the data. If a value is passed to the split, pstub, or results options it will trigger the retain option to be turned on. The default value in the case where the retain option is on and no value is passed to split is _xvsplit.
results(string asis) - a stubname for storing estimation results. The name provided cannot end in a number. If a value is passed to the split, pstub, or results options it will trigger the retain option to be turned on. The default value in the case where the retain option is on and no value is passed to results is xvres.
noall - suppresses fitting the model to the entire training set. Note: this is only applicable to cases where kfold > 1 (which includes xvloo).
fitnm(string asis) - is used to name the collection storing the results; the default is xvfit and only applies to users of Stata 17 and above.
classes(integer) - is used to specify the number of classes for classification models; default is 0 which is used to denote non-classification models. Additionally, if using a binary outcome, remember that there are two (2) classes.
threshold(real) - positive outcome threshold for classification in binary outcome models; default is 0.5.
pstub(string asis) - a new variable name for predicted values. If a value is passed to the split, pstub, or results options it will trigger the retain option to be turned on. The default value in the case where the retain option is on and no value is passed to pstub is _xvpred. If the variable _xvpred already exists a suffix based on the timestamp when the command is executed is used.
monitors(string asis) - zero or more function names from libxv or user-defined functions. Optional arguments for the monitors function can be passed as a matrix. For example, monitors(ccc((1)) mae(("y")) rmse((1, 2, 3))).
See help documentation for cases where options are supported, as well as the required format for those options.
valnm(string asis) - is used to name the collection storing the validation results; default is xvval and only applies to users of Stata 17 and above.
pmethod(string asis) - the method (statistic) to predict with the out-of-sample/held-out data. Defaults to xb when classes == 0 and pr in all other cases.
popts(string asis) - an option that allows users to pass options to the predict command in addition to the prediction statistic/method.
display - display estimation and validation results in the results pane; default is off
retain - retains the split and pstub variables and stored estimation results after execution. If this option is specified without arguments passed to the split, pstub, or results options, the default names are used for them.

xvloo

xvloo # [#], MEtric(string asis) [ seed(integer) Uid(varlist) TPoint(string asis)
             SPLit(string asis) RESults(string asis) fitnm(string asis) 
             Classes(integer) PStub(string asis) noall MOnitors(string asis) 
             DISplay RETain valnm(string asis) PMethod(string asis)
             POpts(string asis) ] : 
             estimation command ...

Syntax and options

# - The proportion of the data set to allocate to the training set. _Note if the training allocation is set to unity, it will automatically trigger the noall option.
[#] - The proportion of the data set to allocate to the validation set.
metric(string asis) - the name of a function from libxv or a user-defined function. Optional arguments for the metric function can be passed as a matrix.
For example, metric(ccc((1))) or metric(ccc(("y"))). See help documentation for cases where options are supported, as well as the required format for those options.
seed(integer) - to set the pseudo-random number generator seed
uid(varlist) - a variable list for clustered sampling/splitting. Note, the variable list must be specified the order from highest to lowest level of nesting of the data; cross-classified sampling is not currently supported.
tpoint(string asis) - a numeric, td(), tc(), or tC() value.
split(string asis) - a new variable name that will store the split identifiers in the data. If a value is passed to the split, pstub, or results options it will trigger the retain option to be turned on. The default value in the case where the retain option is on and no value is passed to split is _xvsplit.
results(string asis) - a stubname for storing estimation results. The name provided cannot end in a number. If a value is passed to the split, pstub, or results options it will trigger the retain option to be turned on. The default value in the case where the retain option is on and no value is passed to results is xvres.
noall - suppresses fitting the model to the entire training set. Note: this is only applicable to cases where kfold > 1 (which includes xvloo).
fitnm(string asis) - is used to name the collection storing the results; the default is xvfit and only applies to users of Stata 17 and above.
classes(integer) - is used to specify the number of classes for classification models; default is 0 which is used to denote non-classification models. Additionally, if using a binary outcome, remember that there are two (2) classes.
threshold(real) - positive outcome threshold for classification in binary outcome models; default is 0.5.
pstub(string asis) - a new variable name for predicted values. If a value is passed to the split, pstub, or results options it will trigger the retain option to be turned on. The default value in the case where the retain option is on and no value is passed to pstub is _xvpred. If the variable _xvpred already exists a suffix based on the timestamp when the command is executed is used.
monitors(string asis) - zero or more function names from libxv or user-defined functions. Optional arguments for the monitors function can be passed as a matrix. For example, monitors(ccc((1)) mae(("y")) rmse((1, 2, 3))).
See help documentation for cases where options are supported, as well as the required format for those options.
valnm(string asis) - is used to name the collection storing the validation results; default is xvval and only applies to users of Stata 17 and above.
pmethod(string asis) - the method (statistic) to predict with the out-of-sample/held-out data. Defaults to xb when classes == 0 and pr in all other cases.
popts(string asis) - an option that allows users to pass options to the predict command in addition to the prediction statistic/method.
display - display estimation and validation results in the results pane; default is off
retain - retains the split and pstub variables and stored estimation results after execution. If this option is specified without arguments passed to the split, pstub, or results options, the default names are used for them.

Phase Specific Commands

splitit

splitit # [#] [if] [in] [, Uid(varlist) TPoint(string asis) KFold(integer 1) 
                           SPLit(string asis) loo ]

Syntax and options

# [#] - At least one numeric value in [0, 1]. A single value is used for train/test splits. Two values are used for train/validate/test splits. The sum of the two values must be <= 1.
uid(varlist) - A user specified varlist used to identify units when splitting the data. When this is populated all records associated with the unit identifier will be added to the train/validate/test split. If a time point is also specified, the time point threshold should retain only cases up to the specified time.
tpoint(string asis) - A user specified time value that will be used to split the data into train/validation/test sets. Requires the data to be -xt/tsset-. If a panel variable is defined in -xtset- it will be used to ensure the split includes entire records prior to the time period used for the split. If -uid- is also specified, it must include the panel variable.
kfold(integer 1) - An optional parameter to trigger the use of K-Fold crossvalidation. The number of folds specified will be used to generate the splits.
split(string asis) - An option to specify the name of the variable that will store the identifiers for each of the splits and folds.
loo - An option used when splitting a dataset for use in leave-one-out cross-validation.

fitit

fitit anything(name = cmd), SPLit(passthru) RESults(string asis) 
            [ KFold(integer 1) noall DISplay NAme(string asis) ]

Syntax and options

cmd is the estimation command the user wishes to fit to the data
split(passthru) - specifies the name of the variable used to identify the train/validate/test or KFold splits in the dataset.
results(string asis) - A name to use to store estimation results estimates store.
kfold(integer 1) - An option used to determine if the model needs to be fitted over k subsets of the data.
noall - Is an option to suppress fitting the model to the entire training set when the number of folds in the training set is > 1.
display - An option to show estimation results from all of the models fitted to each of the folds.
name(string asis) - is used to name the collection storing the results; the default is xvfit and only applies to users of Stata 17 and above.

predictit

predictit [anything(name = cmd)], PStub(string asis) 
            [ SPLit(passthru) Classes(integer 0) 
              KFold(integer 1) THReshold(passthru) 
              MODifin(string asis) KFIfin(string asis) noall 
              PMethod(string asis) POpts(string asis) ]

Syntax and options

cmd is the estimation command the user wishes to fit to the data
pstub(string asis) - A variable name to use to store the predicted values following model fitting.
split(passthru) - specifies the name of the variable used to identify the train/validate/test or KFold splits in the dataset.
classes(integer 0) - An option used to determine whether the model is a regression or classification task. This is subsequently passed to the classify program.
kfold(integer 1) - An option used to determine the appropriate subset of data to use for predictions.
threshold(passthru) - An option that is passed to the classify program for predicting class membership in classification tasks.
modifin(string asis) - the modified if expression used to generate the out of sample predictions.
kfifin(string asis) - the modified if expression used to generate the out of sample predictions for the full training sample when using K-Fold cross-validation.
noall - suppresses prediction on the entire training sample when using K-Fold cross-validation.
pmethod(string asis) - the method (statistic) to predict with the out-of-sample/held-out data. Defaults to xb when classes == 0 and pr in all other cases.
popts(string asis) - an option that allows users to pass options to the predict command in addition to the prediction statistic/method.

validateit

validateit, MEtric(string asis) PStub(string asis) SPLit(string asis) 
          [ Obs(string asis) MOnitors(string asis) DISplay KFold(integer 1) 
            noall loo NAme(string asis) ]

Syntax and options

metric(string asis) - specifies the name of the Mata function to use as the validation metric. Options can be passed to the metric, see the help for libxv or validateit to identify which metrics support this and to see the requirements for specifying options.
pstub(string asis) - A variable name to use to store the predicted values following model fitting.
split(passthru) - specifies the name of the variable used to identify the train/validate/test or KFold splits in the dataset.
obs(string asis) - the name of the dependent variable from the model.
monitors(string asis) - this can be a list of functions used to evaluate the model performance on the out of sample data. Options can be passed if the function supports it. See help documentation for additional info.
display - an option to print the monitor and metric values to the console.
kfold(integer 1) - An option used to compute validation metrics and monitors for each fold, as well as the entire training set if the noall option is not passed.
noall - suppresses prediction on the entire training sample when using K-Fold cross-validation.
loo - Is an option used specifically for leave-one-out cross-validation. It computes validation metrics for the entire training split based on the predicted and observed values from each of the folds. If the noall option is omitted it will also compute the validation metrics from a model fitted to the entire training set with predictions made on the validation set, if present, or test set, if no validation set is present.
name(string asis) - is used to name the collection storing the results; the default is xvval and only applies to users of Stata 17 and above.

Utility commands

classify

classify # [if], PStub(string asis) [ THReshold(real 0.5) PMethod(string asis) 
                                      POpts(string asis) ]

Syntax and options

# - This is the number of classes of the outcome variable being modeled. This value must be integer valued.
pstub(string asis) - Specifies a stub name to use to store the predicted classes from the model.
threshold(real 0.5) - Specifies the threshold to use for classification of predicted probabilities in the case of binary outcome models. The value of the threshold must be in (0, 1).
pmethod(string asis) - the method (statistic) to predict with the out-of-sample/held-out data. Defaults to xb when classes == 0 and pr in all other cases.
popts(string asis) - an option that allows users to pass options to the predict command in addition to the prediction statistic/method.

state

state

Syntax and options

No options

cmdmod

cmdmod anything(name = cmd id = "estimation command"), 
        SPLit(varlist min = 1 max = 1) [ KFold(integer 1) ]

Syntax and options

split(varlist min = 1 max = 1) - specifies the name of the variable used to identify the train/validate/test or KFold splits in the dataset that will be used to fit the model to the training set and that predictions will use the validation set.
kfold(integer 1) - specifies the number of cross-validation folds used in the dataset. The default value indicates that K-Fold cross-validation is not being used and the data should be treated like a train/test or train/validation/test split.

libxv

libxv [, DISplay ]

Syntax and options

display - displays the libxv help file after recompiling the Mata library.

wbuchanan / crossvalidate

readme

crossvalidation

Examples:

TODO

libxv

Metrics/Monitors

Method Signature

Data access

Building the library

Prefix Commands

xv

Syntax and options

xvloo

Syntax and options

Phase Specific Commands

splitit

Syntax and options

fitit

Syntax and options

predictit

Syntax and options

validateit

Syntax and options

Utility commands

classify

Syntax and options

state

Syntax and options

cmdmod

Syntax and options

libxv

Syntax and options