Closed VolodymyrOrlov closed 4 years ago
Drop here some reference notes:
Quoting here:
The general procedure is as follows:
- Shuffle the dataset randomly.
- Split the dataset into k groups
- For each unique group:
- Take the group as a hold out or test data set
- Take the remaining groups as a training data set
- Fit a model on the training set and evaluate it on the test set
- Retain the evaluation score and discard the model
- Summarize the skill of the model using the sample of model evaluation scores
It is also important that any preparation of the data prior to fitting the model occur on the CV-assigned training dataset within the loop rather than on the broader data set. This also applies to any tuning of hyperparameters.
Having 6 observations split in 3 folds (k-3 CV):
Three models are trained and evaluated with each fold given a chance to be the held out test set. For example:
- Model1: Trained on Fold1 + Fold2, Tested on Fold3
- Model2: Trained on Fold2 + Fold3, Tested on Fold1
- Model3: Trained on Fold1 + Fold3, Tested on Fold2 The models are then discarded after they are evaluated as they have served their purpose.
In scikit-learn
:
# data sample
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
# prepare KFold
kfold = KFold(
3, # number of folds
True, # perform shuffle
1 # seed for pseudo-random
)
# enumerate splits
for train, test in kfold.split(data):
print('train: %s, test: %s' % (data[train], data[test]))
Notes for a possible implementation in model_selection
:
/// Entities needed by the library:
/// * A type for a `Vec` of K models to be compared (an array that
/// accepts any type that implements `predict()`?; there is no type
/// that encompasses all the models)
/// * A trait (KFold) with a `cross_validate()` method
///
/// This would allow to define a `Vec` of models to be passed to the
/// KFold cross-validation so to provide the procedure
///
/// Entities involved in the KFold procedure:
/// * a vector of models
/// * a dataset
/// * a number k of groups to use
///
/// Procedure in `cross_validate()`:
/// 1. Shuffle the dataset randomly.
/// 2. Split the dataset into k groups
/// 3. For each unique group (may use Rayon?):
/// 1. Take the group as a hold out or test data set
/// 2. Take the remaining groups as a training data set
/// 3. Fit a model on the training set and evaluate it on the test set
/// 4. Retain the evaluation score and discard the model
/// 4. Summarize the skill of the model using the sample of model evaluation scores
Thanks for sharing your notes, @Mec-iS!
I agree that we need a Trait that encompasses all the models, something like a Predictor (or, maybe, Classifier/Regressor) with a single method predict. We might start thinking about it and discuss it in this issue.
On the other hand we do not need to go ahead with this new Trait(s) right now, because one easy way to bring a k-fold cross validation into SmartCore is to implement a class similar to Scikit's k-fold. This class represents an iterator over k splits of the data. Having this iterator cross validation becomes easier and anyone can implement it as a simple for loop. We can stop here or we can later define an independent function cross_validate
that takes an instance of the iterator along with an estimator and a metric function to run CV and measure estimated test error.
I think it is important to keep function cross_validate
separated from KFold because we might have multiple ways to split data into k folds, implemented as a separate classes and the logic in cross_validate
can be easily detached from these implementations.
This class represents an iterator over k splits of the data. Having this iterator cross validation becomes easier and anyone can implement it as a simple for loop
Let's check we are in the same page: something like this, right?
/// src/model_selection/mod.rs
trait BaseKFold {
/// Return a tuple containing the the training set indices for that split and
/// the testing set indices for that split.
fn split(&self, X: Matrix) -> Iterator< Item = Tuple(ndarray::ArrayBase, ndarray::ArrayBase)>;
/// Returns integer indices corresponding to test sets
fn test_indices(&self, X: Matrix) -> Iterator< Item = i32>;
/// Return matrix corresponding to test sets
fn test_matrices(&self, X: Matrix) -> Iterator< Item = Matrix>;
}
struct KFold {
n_splits: i32,
shuffle: bool,
random_state: i32,
}
impl BaseKFold for KFold {
...
}
Let me know then I open a PR so we can add the implementation bit by bit.
Looks good to me!
More things to come:
derive-builder
for defaults
K-fold cross validation (CV) is a preferred way to evaluate performance of a statistical model. CV is better than just splitting dataset into training/test sets because we use as many data samples for validation as we can get from a single dataset, thus improving estimate of out-of-the-box error.
SmartCore does not has a method for CV and this is a shame, because any good ML framework must have it.
I think we could start from a simple replica of the Scikit's sklearn.model_selection.KFold. Later on we can add replica of StratifiedKFold.
If you are not familiar with CV I would start from reading about it here and here. Next I would look at Scikit's implementation and design a function or a class that does the same for SmartCore.
We do not have to reproduce class KFold exactly, one way to do it is to write an iterator that spits out K pairs of (train, test) sets. Also, it might be helpful to see how train/test split is implemented in SmartCore