Cross-validation process isn't always completed across all folds

dannycg1996 commented 2 months ago

Hi all,

I've found an issue where sometimes, when the time_budget is running out, the final model/s aren't evaluated across all n_splits requested by the user, but instead the process is ended early.

For example, a user might want 10 fold cross-validation to be used, but actually, on the final model, only 4 fold cross validation is used. This means that on rare occasions, FLAML returns a best model (i.e. automl.model) which was evaluated across less folds than the user requested. It's possible that were that model evaluated across all requested folds, it would no longer be the best model.

I can understand why FLAML ends early when the time budget is low, but it would be great if there was either an option to switch this off (i.e. once a model is selected, it is evaluated across all folds) or have the automl.model return the best completed model, not just the model with the highest average validation loss.

This is a hard issue to replicate for two reasons: 1) Performance of FLAML differs on different machines 2) The default logging doesn't log the number of folds used.

I've done my best to make this issue reproducible by writing a custom logging function cv_score_agg_func, which logs an 'n_folds' attribute, showing the number of folds that model has been evaluated against.

This code leads to the below (attached) logs on my machine, which showcase the error (see the "n_folds" 4 on the final line). Others may need to tweak the time_budget to trigger a similar issue.

from flaml import AutoML
from sklearn import datasets

def cv_score_agg_func(val_loss_folds, log_metrics_folds):
    metric_to_minimize = sum(val_loss_folds)/len(val_loss_folds)
    metrics_to_log = None
    for single_fold in log_metrics_folds:
        if metrics_to_log is None:
            metrics_to_log = single_fold
        elif isinstance(metrics_to_log, dict):
            metrics_to_log = {k: metrics_to_log[k] + v for k, v in single_fold.items()}
        else:
            metrics_to_log += single_fold
    if metrics_to_log:
        n = len(val_loss_folds)
        metrics_to_log = (
            {k: v / n for k, v in metrics_to_log.items()}
            if isinstance(metrics_to_log, dict)
            else metrics_to_log / n
        )
        metrics_to_log["n_folds"] = n

    return metric_to_minimize, metrics_to_log

dic_data = datasets.load_iris(as_frame=True)  # numpy arrays
iris_data = dic_data["frame"]  # pandas dataframe data + target
automl = AutoML()
automl_settings = {
    "time_budget": 11,  # in seconds
    "metric": 'accuracy',
    "task": 'classification',
    "log_file_name": "incomplete_error.log",
    "log_type": "all",
    "eval_method": "cv",
    "n_splits":10,
    "cv_score_agg_func": cv_score_agg_func,
    "early_stop": False,
}
x_train = iris_data[["sepal length (cm)","sepal width (cm)", "petal length (cm)","petal width (cm)"]].to_numpy()
y_train = iris_data['target']
automl.fit(x_train, y_train, **automl_settings)

In the attached logs, it should be clear that the final model was not tested against all 10 folds.

incomplete_error.log

In terms of package versions, I'm using FLAML 2.1.2, catboost 1.2.5, scikit-learn 1.5.0 and Python 3.12.0

thinkall commented 1 month ago

Thank you @dannycg1996 for reporting the issue. Would you like to raise a PR to fix it?

dannycg1996 commented 1 month ago

Hi @thinkall, I'd be happy to.

Could I ask for your thoughts in advance of writing the code though please? I've identified the source of this issue, which is this line on generic_task.py

if budget and time.time() - start_time >= budget:
    break

I see two potential solutions here.

1) Simply add another boolean parameter complete_cv_process to Automl.fit() . I'm not sure if this should default to True or False. We could then change the above if statement to be:

if budget and not complete_cv_process and time.time() - start_time >= budget:
    break

This would ensure that the cross-validation process would run to completion. However, there is the obvious issue that this risks the time budget being overrun. For large datasets or slow estimators (especially if someone has implemented something like Support Vector Machines) this may be problematic.

2) We stick to the current approach of ending the cross-validation process early if the time budget runs out, but we scrap incomplete models. We don't log incomplete models (maybe), and we don't compare its validation loss against the current 'best model', so there's no risk of the best_model provided to the user being one which was evaluated through an incomplete cross-validation process. This is probably the better solution, as it respects the allocated time budget. However I've delved into the FLAML codebase, and I'm not sure on how to implement it - would need the advice of someone more experienced with the FLAML codebase on how to safely exit the AutoML process.

thinkall commented 1 month ago

Thank you @dannycg1996 . I'd prefer the first solution. We don't shut down the training exactly at the given time budget now, so this change will not introduce any surprises. As for the second solution, it means we risk wasting a lot of time training a model without using it at all. WDYT?

dannycg1996 commented 1 month ago

Thanks for the quick feedback @thinkall!

I'm happy to implement the first solution. Thinking about it, do we even need to add the complete_cv_process boolean parameter (as outlined above)? I have to assume that users would rather have their AutoML process overrun its budget slightly, than receive an incomplete model. It is probably the better option just to remove the if statement in its entirety, which would ensure that users never receive incomplete models.

What are your thoughts?

thinkall commented 1 month ago

Thanks for the quick feedback @thinkall!

I'm happy to implement the first solution. Thinking about it, do we even need to add the complete_cv_process boolean parameter (as outlined above)? I have to assume that users would rather have their AutoML process overrun its budget slightly, than receive an incomplete model.

It is probably the better option just to remove the if statement in its entirety, which would ensure that users never receive incomplete models.

What are your thoughts?

Do you mean exposing the parameter to FLAML users? Can it be an internal flag variable? We detect whether the cross validation is finished or not and modify the value automatically.

dannycg1996 commented 1 month ago

Sorry, I did originally intend to expose the complete_cv_process parameter to users - it would be a parameter on the automl.fit() method, which would allow users to dictate whether the cross-validation process must run to completion (complete_cv_process=True) or can be exited early if the allocated budget is running out (complete_cv_process = False). However, I don't think anyone will want to set complete_cv_process=False - it isn't worth the trade-off.

I'm not sure how useful an internal flag would be. The if statement I highlighted above

if budget and time.time() - start_time >= budget:
    break

acts to exit the cross-validation process early if the time budget has run out. I could have an internal flag somewhere which tracks whether the cross-validation process was completed or not for a given model, but it isn't much use unless we then use that flag somewhere, and you've stated (very reasonably) above that you'd prefer that we just complete the CV process - even if it means overrunning the time budget.

The best solution in my eyes is just to completely remove the if statement highlighted above, so that we never exit the cross-validation process early - the cross-validation process will be run to completion every time. I've tested it locally, and it works for me. Would you be happy with that solution?

thinkall commented 1 month ago

Sorry, I did originally intend to expose the complete_cv_process parameter to users - it would be a parameter on the automl.fit() method, which would allow users to dictate whether the cross-validation process must run to completion (complete_cv_process=True) or can be exited early if the allocated budget is running out (complete_cv_process = False). However, I don't think anyone will want to set complete_cv_process=False - it isn't worth the trade-off.

I'm not sure how useful an internal flag would be. The if statement I highlighted above
if budget and time.time() - start_time >= budget:
    break
acts to exit the cross-validation process early if the time budget has run out. I could have an internal flag somewhere which tracks whether the cross-validation process was completed or not for a given model, but it isn't much use unless we then use that flag somewhere, and you've stated (very reasonably) above that you'd prefer that we just complete the CV process - even if it means overrunning the time budget.

The best solution in my eyes is just to completely remove the if statement highlighted above, so that we never exit the cross-validation process early - the cross-validation process will be run to completion every time. I've tested it locally, and it works for me. Would you be happy with that solution?

You're right, @dannycg1996 ! I agree with you that we can simply remove the if statement here. We've actually considered the time budget with budget_per_train. The overtime will mainly come from the overhead of each train, that should be acceptable. In your experiments, how much extra time will it spend on cross validation after removing the if statement?

dannycg1996 commented 1 month ago

My original time budget above was only 11 seconds! I'm not sure how long extra completing the CV process took, but I believe it was less than a second. Obviously for slower models/larger datasets this overtime would increase, but I fully agree that this overtime should be acceptable (and is certainly preferable to an incomplete model being returned).

microsoft / FLAML

Cross-validation process isn't always completed across all folds #1349