keras-team / keras-tuner

A Hyperparameter Tuning Library for Keras
https://keras.io/keras_tuner/
Apache License 2.0
2.86k stars 396 forks source link

`Tuner.get_best_models` suffers from data leakage #1018

Open wmay opened 4 months ago

wmay commented 4 months ago

Describe the bug

The purpose of KerasTuner is hyperparameter selection. However the default implementation suffers from a known data leakage problem. For each set of hyperparameter values, the best epoch is chosen based on the validation loss (or validation accuracy), and then that same validation metric is used later to rank the hyperparameters. That means the same validation data is being used to create the model, and then to evaluate the model. That's data leakage.

In practical terms, this means the rankings generated by KerasTuner are biased toward hyperparameter values that are better at overfitting the validation data. This could be leading many KerasTuner users to select incorrect hyperparameters.

In theory, a user could customize a tuner to avoid data leakage, but there's nothing in the documentation to warn users that this may be required. For example, the getting started guide simply shows the default tuner being run using val_accuracy as the objective, with no warning that this method is biased.

And IMO this should not require customization-- tuning without data leakage should be built into KerasTuner because it's a core part of the job KerasTuner is supposed to do.

Expected behavior

The expected behavior is that, by default, KerasTuner performs hyperparameter selection without data leakage. The evaluation data should not be used to select an epoch. That could be done by including a separate testing dataset, or by choosing epochs in a different way.

Either that or the documentation would clearly warn users that the default tuner has a bias due to data leakage.

Additional context

It's not easy to find this info in the code or documentation. The documentation for Tuner.get_best_models mentions the best epoch is selected for model weights, but leaves it unclear whether the best epoch is used for hyperparameter evaluation. I finally found the implementation in MetricHistory.get_best_value.

This is an important problem that can lead to large biases. For example, from Data Leakage and Evaluation Issues in Micro-Expression Analysis:

The most common issue is using test data to determine a hyperparameter, that is, the number of epochs. We find that methods achieving close to 80 F1-Score, but in fact only reach a performance of around 50 F1-Score when the data leak issue is fixed.

We further experiment whether using early stopping properly, i.e., by using validation data (ESV), has an impact. [...] Early stopping is performed on the validation data, and testing data is not touched until the model has been fully trained. The results on NMER ESV show no improvement. The experiments show that using early stopping with test data can create a large positive bias, while using the validation data shows barely no impact.

Would you like to help us fix it?

I don't know how much you want me digging around in the guts of the source code. This may be something keras-team should handle. But sure, I could help.

Edit: To clarify, I don't know how important this issue is in practice. I have some suspicions, but I haven't carefully checked that this changes the outcome for any problem I'm working on (if I do I'll be sure to post the example here).