Python machine learning library based on Object Oriented design principles; the goal is to allow users to quickly explore data and search for top machine learning algorithm candidates for a given dataset
Note: the assumption in flow is that each specific model will have been previously cross-validated in order to choose the best hyper-params for the specific model; so, at least in v1, the stacker will not search for params. Possibly could add nesting cross-validation in the future.
1. Partition the training data into five test folds (note: 5 could be refactored as parameter)
2. Create a dataset called train_meta with the same row Ids and fold Ids as the training dataset, with empty
columns M1 and M2.
Similarly create a dataset called test_meta with the same row Ids as the test dataset and empty columns
M1 and M2 (NOTE: this will be in the `_predict` function
3. For each test fold
3.1 Combine the other four folds to be used as a training fold
3.2 For each base model (with chosen hyper-params)
3.2.1 Fit the base model to the training fold and make predictions on the test fold.
Store these predictions in train_meta to be used as features for the stacking model
NOTE: i will also have to do the model specific Transformations
4. Fit each base model to the full training dataset and make predictions on the test dataset.
Store these predictions inside test_meta
NOTE: i will make predictions as part of the `_predict` function
5. Fit a new model, S (i.e the stacking model) to train_meta, using M1 and M2 as features.
Optionally, include other features from the original training dataset or engineered features.
6. Use the stacked model S to make final predictions on test_meta
NOTE: this will be in `_predict`
Other Requirements
[x] show resampling data for each model (for >= 1 scores)
[x] show correlation matrix between predictors of train_meta
Overview
v1 stacking will be simple and based off of http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/
Note: the assumption in flow is that each specific model will have been previously cross-validated in order to choose the best hyper-params for the specific model; so, at least in v1, the stacker will not search for params. Possibly could add nesting cross-validation in the future.
(Suppose we use ModelFitter); The
data_x
passed in is the full training set. We will assume a test set is withheld. adopted from http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/Steps
Other Requirements
train_meta