mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.05k stars 407 forks source link

custom validation set #401

Open ThomasBourgeois opened 3 years ago

ThomasBourgeois commented 3 years ago

Great lib ! Greater with a possibility to input a custom validation set. In my case, I have correlation between the train points so I cannot use a cross val, neither a split val with or without shuffle. I need to input a dedicated validation "far" in time from the train data points.

pplonski commented 3 years ago

@ThomasBourgeois it is possible to do with custom validation strategy. Sorry that there is no proper docs with examples for this. It is a quite fresh feature. I will add docs for it.

The example below generate dataset with 100 samples. First 75 samples are used for training, and last 25 samples are used for validation.

import numpy as np
import pandas as pd
from sklearn import datasets
from supervised.automl import AutoML

X, y = datasets.make_regression(
    n_samples=100,
    n_features=10,
    n_informative=5,
    random_state=0,
)

X = pd.DataFrame(X)

train_indices = np.array(X.index[:75])
test_indices = np.array(X.index[75:])

print("train indices:", train_indices)
print("test indices:", test_indices)

cv = [(train_indices, test_indices)]

automl = AutoML(validation_strategy={"validation_type": "custom"})
automl.fit(X, y, cv=cv)

The output of the script:

train indices: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
 72 73 74]
test indices: [75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98
 99]
AutoML directory: AutoML_2
The task is regression with evaluation metric rmse
AutoML will use algorithms: ['Baseline', 'Linear', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble availabe models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 3 models
Custom validation strategy
Split 0.
Train 75 samples.
Validation 25 samples.
1_Baseline rmse 242.853446 trained in 0.2 seconds
2_DecisionTree rmse 191.399131 trained in 3.74 seconds
3_Linear rmse 0.0 trained in 2.98 seconds
* Step default_algorithms will try to check up to 3 models
4_Default_Xgboost rmse 127.245958 trained in 4.54 seconds
5_Default_NeuralNetwork rmse 31.910755 trained in 1.66 seconds
6_Default_RandomForest rmse 192.98627 trained in 1.89 seconds
* Step ensemble will try to check up to 1 model
Ensemble rmse 0.0 trained in 0.16 seconds

The main idea is to:

Please let me know if you need more examples.

ThomasBourgeois commented 3 years ago

Thanks for your answer ! The example is quite straightforward, thanks.

Maybe one comment : 'cv' is sometimes used in classic libraries such as scikit learn as short for 'cross validation' rather than custom validation. (e.g. cross_val_score(clf, X, y, cv=5)) This may be a bit confusing here at first.

And yeah, putting it into the doc would be great !! Thanks again for the great lib !

pplonski commented 3 years ago

The cv parameter was added to be similar as in scikit. Take a look at cv in GridSearchCV - you can pass integer or iterable/generator. In MLJAR we dont support integer but iterable/generator should work. I will add this to docs as well.

ThomasBourgeois commented 3 years ago

Ok get it, I thought it meant custom validation, I understand now you can pass multiple tuples and do a cross validation, hence why you call it cv !

andrew-zaborenko commented 3 years ago

Hi! Thanks a lot for all the time you've put into this project. It's awesome!

I was wondering if it is possible to pass the weights vector of the validation dataset in this custom validation strategy. I have a train dataset which consists of features_train, labels_train, weights_train and a test dataset with features_test, labels_test, weights_test. I can merge features_trainand features_test into one DataFrame and pass their respective indices to cv, however I'm a bit stuck on what I am supposed to do with the weights vector. Should I concatenate weights_train and weights_test into a single array and pass it to the sample_weight parameter?

Thanks again for this great package!

pplonski commented 3 years ago

@andrew-zaborenko you are right! Just concatenate training and testing weights and pass them as sample_weight. It will work.

Here is the code for custom validation, for sample_weight split https://github.com/mljar/mljar-supervised/blob/a0846e5717c6ecef7b1c61689620f20b3569096e/supervised/validation/validator_custom.py#L97

When passing features and weights please take care to check if they have the same indices. Please let me know if it works for you.

andrew-zaborenko commented 3 years ago

Thank you for your quick reply! I tried what you described and it worked. Here is my code if anyone else has their dataset prepared in the similar manner (I work with particle physics data):

import numpy as np
from supervised.automl import AutoML

features = np.concatenate((features_train, features_test), axis=0)
labels = np.concatenate((labels_train, labels_test), axis=0)
weights = np.concatenate((weights_train, weights_test), axis=0)

train_indices = np.array(list(range(0, len(features_train))))
test_indices = np.array(list(range(len(features_train), len(features))))

cv = [(train_indices, test_indices)]

path = 'binary_classification_Explain_cv'

automl = AutoML(mode='Explain', ml_task='binary_classification', results_path=path, total_time_limit=10*3600, explain_level=2, validation_strategy={"validation_type": "custom"})
automl.fit(X=features, y=labels, sample_weight=weights, cv=cv)

The output:

Linear algorithm was disabled.
AutoML directory: binary_classification_Explain_cv
The task is binary_classification with evaluation metric logloss
AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network']
AutoML will ensemble availabe models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'ensemble']
* Step simple_algorithms will try to check up to 2 models
Custom validation strategy
Split 0.
Train 137036 samples.
Validation 137446 samples.
1_Baseline logloss 0.693147 trained in 2.58 seconds
2_DecisionTree logloss 0.603379 trained in 65.04 seconds
* Step default_algorithms will try to check up to 3 models
3_Default_Xgboost logloss 0.505556 trained in 213.25 seconds
4_Default_NeuralNetwork logloss 0.547975 trained in 234.69 seconds
5_Default_RandomForest logloss 0.573356 trained in 205.25 seconds
* Step ensemble will try to check up to 1 model
Ensemble logloss 0.504137 trained in 36.46 seconds
An input array is constant; the correlation coefficent is not defined.
AutoML fit time: 851.69 seconds
AutoML best model: Ensemble

P.S. I think you should mark "Add weights vector" as completed in your roadmap here: https://supervised.mljar.com/roadmap/ :)