[Bug] Hyperparameter tune with xgboost does not checkpoint correct model.

pollackscience commented 2 years ago

Search before asking

[X] I searched the issues and found no similar issues.

Ray Component

Ray Tune

What happened + What you expected to happen

I'm training a binary classifier for a massively imbalanced dataset. While doing hyperparameter search with ray.tune, I've noticed that the checkpointed 'best model' does not produce the listed score when run on the identical evaluation set. The difference can be very large.

Versions / Dependencies

python 3.8.10 ray 1.8.0 RHEL 8.4

Reproduction script

This is a sample script that kind-of mimics my use case. It's mainly based off of https://docs.ray.io/en/latest/tune/tutorials/tune-xgboost.html

import sklearn.datasets
import sklearn.metrics
import os
from ray.tune.schedulers import ASHAScheduler
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
import xgboost as xgb

from ray import tune
from ray.tune.integration.xgboost import TuneReportCheckpointCallback

# define dataset
X, y = make_classification(n_samples=1000000, n_features=10,
    weights=[0.9999], random_state=1, class_sep=1e8)

def train_bdt(config: dict, train_x, train_y, test_x, test_y):
    # This is a simple training function to be passed into Tune
    # Load dataset
    #data, labels = sklearn.datasets.load_breast_cancer(return_X_y=True)
    # Split into train and test set
    # train_x, test_x, train_y, test_y = train_test_split(
    #     data, labels, test_size=0.25)
    # Build input matrices for XGBoost
    train_set = xgb.DMatrix(train_x, label=train_y)
    test_set = xgb.DMatrix(test_x, label=test_y)
    # Train the classifier, using the Tune callback
    xgb.train(
        config,
        train_set,
        evals=[(test_set, "eval")],
        verbose_eval=False,
        callbacks=[TuneReportCheckpointCallback(filename="model.xgb")])

def get_best_model_checkpoint(analysis):
    best_bst = xgb.Booster()
    best_bst.load_model(os.path.join(analysis.best_checkpoint, "model.xgb"))
    auc = analysis.best_result["eval-auc"]
    print(f"Best model parameters: {analysis.best_config}")
    print(f"Best model auc: {auc:.4f}")
    return best_bst

def tune_xgboost(train_x, train_y, test_x, test_y):
    search_space= {'learning_rate': tune.loguniform(1e-4, 5e-1),
                   'max_depth': tune.choice([10,12]),
                   'colsample_bytree': tune.loguniform(1e-2,1),
                   'subsample': tune.loguniform(1e-2,1),
                   'min_child_weight': tune.uniform(1e-1, 2),
                   'gamma': tune.uniform(0, 1),
                   'random_state': 979,
                   'eval_metric': 'auc',
                   'tree_method': 'gpu_hist',
                   'scale_pos_weight': 1000,
                   'objective': 'binary:logistic'}
    # This will enable aggressive early stopping of bad trials.
    scheduler = ASHAScheduler(
        max_t=100,  # 100 training iterations
        grace_period=1,
        reduction_factor=2)

    analysis = tune.run(
        tune.with_parameters(train_bdt,
                             train_x=train_x,
                             train_y=train_y,
                             test_x=test_x,
                             test_y=test_y),
        metric="eval-auc",
        mode="max",
        # You can add "gpu": 0.1 to allocate GPUs
        resources_per_trial={"cpu": 2, 'gpu': 0.015},
        config=search_space,
        num_samples=100,
        scheduler=scheduler,
        verbose=1)

    return analysis

# data, labels = sklearn.datasets.load_breast_cancer(return_X_y=True)
# Split into train and test set
train_x, test_x, train_y, test_y = train_test_split(
     X, y, test_size=0.25)
analysis = tune_xgboost(train_x, train_y, test_x, test_y)

best_bst = get_best_model_checkpoint(analysis)
test_set = xgb.DMatrix(test_x, label=test_y)
print(best_bst.eval(test_set))

Anything else

When I run this example, ray.tune reports the best AUC as 0.5266, but when I load the checkpoint and run on the test data, I get AUC=0.5179. When I run my actual data, tune reports an AUC of 0.99+, but loading the "best" checkpoint gives me an AUC of ~0.5. If I then train a new model from scratch with the best hyperparams, my new model will get an AUC very close to the tune report. I think having a massively imbalanced dataset worsens this issue, but I'm not sure. Could be related to #19173 , since it seems that checkpointed models are only saving for every 5th iteration.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

stale[bot] commented 2 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

If you'd like to keep the issue open, just leave any comment, and the stale label will be removed!
If you'd like to get more attention to the issue, please tag one of Ray's contributors.

You can always ask for help on our discussion forum or Ray's public slack channel.

stale[bot] commented 2 years ago

Hi again! The issue will be closed because there has been no more activity in the 14 days since the last message.

Please feel free to reopen or open a new issue if you'd still like it to be addressed.

Again, you can always ask for help on our discussion forum or Ray's public slack channel.

Thanks again for opening the issue!

ray-project / ray