perpetual-ml / perpetual

A self-generalizing gradient boosting machine which doesn't need hyperparameter optimization
https://perpetual-ml.com/
GNU Affero General Public License v3.0
242 stars 9 forks source link

What am I doing wrong? Results are poor for toy datasets. #14

Closed strelzoff-personal closed 5 days ago

strelzoff-personal commented 1 week ago
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
import time, pandas as pd
from perpetual import PerpetualBooster

def evaluate(model, X_train, y_train, X_test, y_test, budget=None):
    start = time.time(); model.fit(X_train, y_train, budget=budget) if budget else model.fit(X_train, y_train)
    return time.time()-start, accuracy_score(y_test, model.predict(X_test)), log_loss(y_test, model.predict_proba(X_test))

datasets = {"Breast Cancer": load_breast_cancer(return_X_y=True), "Binary Iris": (load_iris(return_X_y=True)[0][load_iris().target!=2], load_iris(return_X_y=True)[1][load_iris().target!=2])}
results = pd.DataFrame(columns=["Dataset", "Model", "Budget", "Time", "Accuracy", "Log Loss"])

for name, (X, y) in datasets.items():
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    pb = PerpetualBooster(objective="LogLoss"); rf = RandomForestClassifier()
    results = pd.concat([results, pd.DataFrame([[name, "PB", "Default", *evaluate(pb, X_train, y_train, X_test, y_test)]], columns=results.columns), pd.DataFrame([[name, "PB", 2, *evaluate(pb, X_train, y_train, X_test, y_test, budget=2)]], columns=results.columns), pd.DataFrame([[name, "RF", "N/A", *evaluate(rf, X_train, y_train, X_test, y_test)]], columns=results.columns)], ignore_index=True)
deadsoul44 commented 1 week ago

These are toy datasets. Try with a real dataset.

The algorithm is early stopping without overfitting.

We are working on automl benchmark. https://github.com/openml/automlbenchmark

This benchmark also contains small datasets. We will improve the algorithm for small / toy datasets also with some minor tweaks.

deadsoul44 commented 5 days ago
import time
import pandas as pd
from sklearn.datasets import load_breast_cancer, load_iris
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, log_loss
from perpetual import PerpetualBooster

def evaluate(model, X_train, y_train, X_test, y_test, budget=None):
    start = time.time()
    model.fit(X_train, y_train, budget=budget) if budget else model.fit(X_train, y_train)
    duration = time.time() - start
    return duration, accuracy_score(y_test, model.predict(X_test)), log_loss(y_test, model.predict_proba(X_test))

datasets = {"Breast Cancer": load_breast_cancer(return_X_y=True), "Binary Iris": (load_iris(return_X_y=True)[0][load_iris().target!=2], load_iris(return_X_y=True)[1][load_iris().target!=2])}
results = pd.DataFrame(columns=["Dataset", "Model", "Budget", "Time", "Accuracy", "Log Loss"])

for name, (X, y) in datasets.items():
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    pb = PerpetualBooster(objective="LogLoss")
    rf = RandomForestClassifier()
    results = pd.concat([results,
                         pd.DataFrame([[name, "Perpetual", "0.1", *evaluate(pb, X_train, y_train, X_test, y_test, budget=0.1)]], columns=results.columns),
                         pd.DataFrame([[name, "Perpetual", "1.0", *evaluate(pb, X_train, y_train, X_test, y_test, budget=1.0)]], columns=results.columns),
                         pd.DataFrame([[name, "Perpetual", "2.0", *evaluate(pb, X_train, y_train, X_test, y_test, budget=2.0)]], columns=results.columns),
                         pd.DataFrame([[name, "RF", "-", *evaluate(rf, X_train, y_train, X_test, y_test)]], columns=results.columns),
                        ],
                    ignore_index=True)
deadsoul44 commented 5 days ago

v0.5.0 is released with the fix. Results:

Dataset Model Budget Time Accuracy Log Loss
0 Breast Cancer Perpetual 0.1 149.592 0.973684 0.158678
1 Breast Cancer Perpetual 1.0 129.906 0.973684 0.123220
2 Breast Cancer Perpetual 2.0 155.879 0.973684 0.099885
3 Breast Cancer RF - 0.522181 0.964912 0.103776
4 Binary Iris Perpetual 0.1 0.335295 1.000000 0.000032
5 Binary Iris Perpetual 1.0 0.378495 1.000000 0.000273
6 Binary Iris Perpetual 2.0 0.334572 1.000000 0.004814
7 Binary Iris RF - 0.305424 1.000000 0.002518