mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3k stars 401 forks source link

Grouped Time Series Validation #315

Closed Jhixx24 closed 3 years ago

Jhixx24 commented 3 years ago
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from supervised.automl import AutoML # mljar-supervised

def correlation(predictions, targets):
    ranked_preds = predictions.rank(pct=True, method="first")
    return np.corrcoef(ranked_preds, targets)[0, 1]

def score_2(df):
    return correlation(df[PREDICTION_NAME], df[TARGET_NAME])

TARGET_NAME = f"target"
PREDICTION_NAME = f"prediction"

training_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz")
tournament_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_tournament_data.csv.xz")
training_data = training_data.drop('Unnamed: 0',axis=1)
tournament_data = tournament_data.drop('Unnamed: 0',axis=1)
training_data2 =training_data[0:int(len(training_data)*0.05)]
tournament_data2 =tournament_data[0:int(len(tournament_data)*0.05)]
training_data2
tournament_data2
#############################################################################################
eras = training_data["era"]
#############################################################################################
features = [f for f in training_data2.columns if f.startswith("feature_intelligence1")]
X = training_data[features]
X2 = tournament_data[features]
y = training_data[TARGET_NAME]

#{`Explain`, `Perform`, `Compete`}
automl = AutoML(mode="Explain",algorithms=["Xgboost"],validation_strategy=key)
automl.fit(X, y)

predictions = automl.predict(X2)

pd.Series(predictions)
tournament_data2[PREDICTION_NAME] = predictions
tournament_data2

validation_data = training_data2[training_data2.data_type == "validation"]
validation_correlations = validation_data.groupby("era").apply(score_2)
validation_correlations

#############################################################################################
class TimeSeriesSplitGroups(_BaseKFold):
    def __init__(self, n_splits=5):
        super().__init__(n_splits, shuffle=False, random_state=None)

    def split(self, X, y=None, groups=None):
        X, y, groups = indexable(X, y, groups)
        n_samples = _num_samples(X)
        n_splits = self.n_splits
        n_folds = n_splits + 1
        group_list = np.unique(groups)
        n_groups = len(group_list)
        if n_folds > n_groups:
            raise ValueError(
                ("Cannot have number of folds ={0} greater"
                 " than the number of samples: {1}.").format(n_folds,
                                                             n_groups))
        indices = np.arange(n_samples)
        test_size = (n_groups // n_folds)
        test_starts = range(test_size + n_groups % n_folds,
                            n_groups, test_size)
        test_starts = list(test_starts)[::-1]
        for test_start in test_starts:

            yield (indices[groups.isin(group_list[:test_start])],
                   indices[groups.isin(group_list[test_start:test_start + test_size])])
pplonski commented 3 years ago

Hi @Jhixx24! I know Numerai data quite well.

This type of cross-validation request can be added, but hard to say when. Would you like to follow an example from Numerai's forum or you have some unique idea for feature engineering that will work with this type of validation?

Going back to the data itself. Please run it on a 10-fold CV and it will work pretty well (better than the example model provided by Numerai as far as I remember). But you need to run on a decent machine for at least 12 hours. Below is the example code how to use MLJAR AutoML:

train = pd.read_csv("numerai_training_data.csv")
x_cols = [f for f in train.columns if "feature" in f]
y_col = "target"

automl = AutoML(
    ml_task="regression",
    mode="Compete",
    total_time_limit=12 * 60 * 60,
)
automl.fit(train[x_cols], train[y_col])
Jhixx24 commented 3 years ago

hi..I am trying to run AutoML but i cannot get it to split the data according to the different "eras" or periods in the dataset. i have tried 5fold cv but it splits and shuffles them without respecting the era. i have added TimeSeriesSplitGroups class from a different script that gets the appropriate eras data but do not know how to implement it in your code. any assistance would be appreciated.

Jhixx24 commented 3 years ago

so the current implementation of cv will work ?..because i suspect it will not cut the data at the right places.

pplonski commented 3 years ago

OK, got it. But my point is, that you don't need to respect "eras" when doing validation. You can simply shuffle samples from different eras and train AutoML with 5-fold or 10-fold CV. I'm using such approach - MLJAR AutoML is part of my ensemble. My performance metric is below:

image

I'm not using validation data for the training. The last tip from me, I'm using feature neutralization.

Jhixx24 commented 3 years ago

ahh ok thanks..last question..might you have an idea on how to get rid of this error while installing Auto ML ? RuntimeError: Building llvmlite requires LLVM 10.0.x or 9.0.x, got '11.0.1'. Be sure to set LLVM_CONFIG to the right executable path.

pplonski commented 3 years ago

I need more information:

pplonski commented 3 years ago

Python 3.9 is not yet supported. Please try with Python 3.7

Jhixx24 commented 3 years ago

Ok..i will try it. Thank you.

Jhixx24 commented 3 years ago

it worked ! 👍

pplonski commented 3 years ago

@Jhixx24 Grouped Time Series validation can be applied with custom validation. Closing the issue. Fixed in #380.