Simple k-fold cross validation

VolodymyrOrlov commented 4 years ago

K-fold cross validation (CV) is a preferred way to evaluate performance of a statistical model. CV is better than just splitting dataset into training/test sets because we use as many data samples for validation as we can get from a single dataset, thus improving estimate of out-of-the-box error.

SmartCore does not has a method for CV and this is a shame, because any good ML framework must have it.

I think we could start from a simple replica of the Scikit's sklearn.model_selection.KFold. Later on we can add replica of StratifiedKFold.

If you are not familiar with CV I would start from reading about it here and here. Next I would look at Scikit's implementation and design a function or a class that does the same for SmartCore.

We do not have to reproduce class KFold exactly, one way to do it is to write an iterator that spits out K pairs of (train, test) sets. Also, it might be helpful to see how train/test split is implemented in SmartCore

Mec-iS commented 4 years ago

Drop here some reference notes:

procedure

Quoting here:

The general procedure is as follows:

Shuffle the dataset randomly.

Split the dataset into k groups

For each unique group:

Take the group as a hold out or test data set

Take the remaining groups as a training data set

Fit a model on the training set and evaluate it on the test set

Retain the evaluation score and discard the model

Summarize the skill of the model using the sample of model evaluation scores

stress points

It is also important that any preparation of the data prior to fitting the model occur on the CV-assigned training dataset within the loop rather than on the broader data set. This also applies to any tuning of hyperparameters.

example

Having 6 observations split in 3 folds (k-3 CV):

Three models are trained and evaluated with each fold given a chance to be the held out test set. For example:

Model1: Trained on Fold1 + Fold2, Tested on Fold3

Model2: Trained on Fold2 + Fold3, Tested on Fold1

Model3: Trained on Fold1 + Fold3, Tested on Fold2 The models are then discarded after they are evaluated as they have served their purpose.

interface

In scikit-learn:

# data sample
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])

# prepare KFold
kfold = KFold(
    3,         # number of folds
    True,    # perform shuffle
    1          # seed for pseudo-random
)

# enumerate splits
for train, test in kfold.split(data):
    print('train: %s, test: %s' % (data[train], data[test]))

Mec-iS commented 4 years ago

Notes for a possible implementation in model_selection:

/// Entities needed by the library:
///     * A type for a `Vec` of K models to be compared (an array that
///         accepts any type that implements `predict()`?; there is no type
///         that encompasses all the models)
///     * A trait (KFold) with a `cross_validate()` method
/// 
/// This would allow to define a `Vec` of models to be passed to the
///   KFold cross-validation so to provide the procedure
/// 
/// Entities involved in the KFold procedure:
///     * a vector of models
///     * a dataset
///     * a number k of groups to use 
/// 
/// Procedure in `cross_validate()`: 
///   1. Shuffle the dataset randomly.
///   2. Split the dataset into k groups
///   3. For each unique group (may use Rayon?):
///         1. Take the group as a hold out or test data set
///         2. Take the remaining groups as a training data set
///         3. Fit a model on the training set and evaluate it on the test set
///         4. Retain the evaluation score and discard the model
///   4. Summarize the skill of the model using the sample of model evaluation scores

VolodymyrOrlov commented 4 years ago

Thanks for sharing your notes, @Mec-iS!

I agree that we need a Trait that encompasses all the models, something like a Predictor (or, maybe, Classifier/Regressor) with a single method predict. We might start thinking about it and discuss it in this issue.

On the other hand we do not need to go ahead with this new Trait(s) right now, because one easy way to bring a k-fold cross validation into SmartCore is to implement a class similar to Scikit's k-fold. This class represents an iterator over k splits of the data. Having this iterator cross validation becomes easier and anyone can implement it as a simple for loop. We can stop here or we can later define an independent function cross_validate that takes an instance of the iterator along with an estimator and a metric function to run CV and measure estimated test error.

I think it is important to keep function cross_validate separated from KFold because we might have multiple ways to split data into k folds, implemented as a separate classes and the logic in cross_validate can be easily detached from these implementations.

Mec-iS commented 4 years ago

This class represents an iterator over k splits of the data. Having this iterator cross validation becomes easier and anyone can implement it as a simple for loop

Let's check we are in the same page: something like this, right?

/// src/model_selection/mod.rs

trait BaseKFold {
    /// Return a tuple containing the the training set indices for that split and
    /// the testing set indices for that split.
    fn split(&self, X: Matrix) -> Iterator< Item = Tuple(ndarray::ArrayBase, ndarray::ArrayBase)>;

    /// Returns integer indices corresponding to test sets
    fn test_indices(&self, X: Matrix) -> Iterator< Item = i32>;

    /// Return matrix corresponding to test sets
    fn test_matrices(&self, X: Matrix) -> Iterator< Item = Matrix>;

}

struct KFold {
    n_splits: i32,
    shuffle: bool,
    random_state: i32, 
}

impl BaseKFold for KFold {
    ...
}

Let me know then I open a PR so we can add the implementation bit by bit.

VolodymyrOrlov commented 4 years ago

Looks good to me!

Mec-iS commented 4 years ago

More things to come:

consider usage of derive-builder for defaults
other KFold implementations: 'GroupKFold', 'LeaveOneGroupOut', 'LeaveOneOut', 'LeavePGroupsOut', 'LeavePOut', etc.

smartcorelib / smartcore