The purpose of KerasTuner is hyperparameter selection. However the default implementation suffers from a known data leakage problem. For each set of hyperparameter values, the best epoch is chosen based on the validation loss (or validation accuracy), and then that same validation metric is used later to rank the hyperparameters. That means the same validation data is being used to create the model, and then to evaluate the model. That's data leakage.
In practical terms, this means the rankings generated by KerasTuner are biased toward hyperparameter values that are better at overfitting the validation data. This could be leading many KerasTuner users to select incorrect hyperparameters.
In theory, a user could customize a tuner to avoid data leakage, but there's nothing in the documentation to warn users that this may be required. For example, the getting started guide simply shows the default tuner being run using val_accuracy as the objective, with no warning that this method is biased.
And IMO this should not require customization-- tuning without data leakage should be built into KerasTuner because it's a core part of the job KerasTuner is supposed to do.
Expected behavior
The expected behavior is that, by default, KerasTuner performs hyperparameter selection without data leakage. The evaluation data should not be used to select an epoch. That could be done by including a separate testing dataset, or by choosing epochs in a different way.
Either that or the documentation would clearly warn users that the default tuner has a bias due to data leakage.
Additional context
It's not easy to find this info in the code or documentation. The documentation for Tuner.get_best_modelsmentions the best epoch is selected for model weights, but leaves it unclear whether the best epoch is used for hyperparameter evaluation. I finally found the implementation in MetricHistory.get_best_value.
The most common issue is using test data to determine a hyperparameter, that is, the number of epochs. We find that methods achieving close to 80 F1-Score, but in fact only reach a performance of around 50 F1-Score when the data leak issue is fixed.
We further experiment whether using early stopping properly, i.e., by using validation data (ESV), has an impact. [...] Early stopping is performed on the validation data, and testing data is not touched until the model has been fully trained. The results on NMER ESV show no improvement. The experiments show that using early stopping with test data can create a large positive bias, while using the validation data shows barely no impact.
Would you like to help us fix it?
I don't know how much you want me digging around in the guts of the source code. This may be something keras-team should handle. But sure, I could help.
Edit: To clarify, I don't know how important this issue is in practice. I have some suspicions, but I haven't carefully checked that this changes the outcome for any problem I'm working on (if I do I'll be sure to post the example here).
Describe the bug
The purpose of KerasTuner is hyperparameter selection. However the default implementation suffers from a known data leakage problem. For each set of hyperparameter values, the best epoch is chosen based on the validation loss (or validation accuracy), and then that same validation metric is used later to rank the hyperparameters. That means the same validation data is being used to create the model, and then to evaluate the model. That's data leakage.
In practical terms, this means the rankings generated by KerasTuner are biased toward hyperparameter values that are better at overfitting the validation data. This could be leading many KerasTuner users to select incorrect hyperparameters.
In theory, a user could customize a tuner to avoid data leakage, but there's nothing in the documentation to warn users that this may be required. For example, the getting started guide simply shows the default tuner being run using
val_accuracy
as the objective, with no warning that this method is biased.And IMO this should not require customization-- tuning without data leakage should be built into KerasTuner because it's a core part of the job KerasTuner is supposed to do.
Expected behavior
The expected behavior is that, by default, KerasTuner performs hyperparameter selection without data leakage. The evaluation data should not be used to select an epoch. That could be done by including a separate testing dataset, or by choosing epochs in a different way.
Either that or the documentation would clearly warn users that the default tuner has a bias due to data leakage.
Additional context
It's not easy to find this info in the code or documentation. The documentation for
Tuner.get_best_models
mentions the best epoch is selected for model weights, but leaves it unclear whether the best epoch is used for hyperparameter evaluation. I finally found the implementation inMetricHistory.get_best_value
.This is an important problem that can lead to large biases. For example, from Data Leakage and Evaluation Issues in Micro-Expression Analysis:
Would you like to help us fix it?
I don't know how much you want me digging around in the guts of the source code. This may be something keras-team should handle. But sure, I could help.
Edit: To clarify, I don't know how important this issue is in practice. I have some suspicions, but I haven't carefully checked that this changes the outcome for any problem I'm working on (if I do I'll be sure to post the example here).