Open fraimondo opened 4 months ago
Hmm, so you think this class should behave the same as sklearn's GridSearchCV. As you described, we have a simple workaround to do so and I'm not sure this class should have exact same behaviour as the sklearn's one because it depends on study
and the current behaviour is consistent with the optuna's optimize method.
I understand that optuna's learning process allows for incremental data input. However, this changes completely the semantics of scikit-learn's fit
method, to the point that it is not suitable (and even wrong) in the context of scikit-learn's model evaluation procedures.
As an example, think of a call to scikit-learn's cross_validate
function where the cv
parameter is a 2-fold CV scheme and the estimator is an OptunaSearchCV
object. Ideally, we should obtain two performance estimates, each trained on 50% of the data and tested on the other 50%. With the current implementation of OptunaSearchCV
, the second time that the fit
method is called, it would have learnt from 100% of the data, including the test-sample. This is a test-to-train data leakage.
The solution is quite simple. On every call to fit
, a new study should be created, copying the sampler/prunner/config of the study used on the constructor. As it stands right now, the only proper way to use this class in the context of scikit-learn is to leave the study
parameter at None, which does not allow to specify sampler/prunners/n_trials/etc.
Thank you for clarification.
Alternatively, passing a new study to OptunaSearchCV
will be solution too, where we can specify sampler/pruner, etc, even it though it is not compatible with sklearn's semantic.
My concern on your suggestion is the storage. I suppose the approach works only with the default storage: in memory, because a study instance has storage info. So another rule or argument is necessary to create a new study when calling fit method.
Thank you for clarification.
Alternatively, passing a new study to
OptunaSearchCV
will be solution too, where we can specify sampler/pruner, etc, even it though it is not compatible with sklearn's semantic.
This has exactly the issue I described before. Passing a study
to the OptunaSearchCV
object makes it incorrect within scikit-learn's integration, which I think is exactly the point of having an OptunaSearchCV
class (i.e. to integrate with scikit-learn)
Everything can be solved easily, including the storage issue you mentioned before. Basically, instead of using the study
specified in the constructor of OptunaSearchCV
, use the same parameters but change the study_name
adding a suffix that specifies the current fit
call. This can be done by changing the current code:
To this:
else:
prefix_name = self.study.study_name
i_fit = 0
for t_study in self.study._storage.get_all_studies():
if re.fullmatch(f"{prefix_name}_fit[0-9]+", t_study.study_name) is not None:
i_fit += 1
self.study_ = study_module.create_study(
direction="maximize",
sampler=self.study.sampler,
pruner=self.study.pruner,
study_name=f"{prefix_name}_fit{i_fit}",
storage=self.study._storage,
load_if_exists=False,
)
This creates one entry in the storage each time the fit
method is called. It also allows to inspect using the optuna dashboard and check if the CVs are somehow reaching a plateau, thus optimizing well, or maybe the study needs to be parametrised better:
Expected behavior
When CV is used to evaluate a model's performance, it requires fitting the same model several times with different training datasets. Like GridSearchCV, OptunaSearchCV should find the best set of hyperparameters on each
fit
call, independently from previousfit
calls. In a nutshell, in scikit-learn, callingfit
should overwrite what has been learned in the previous fit.If we define a
study
and use it in theOptunaSearchCV
object, each call to fit will still consider previously tested hyperparameters.Running this code:
I can get this output:
We can see that after the first 10 trials, when the
fit
method is called again, we still consider trial 0 as the best.However, this is not the case when the
study
parameter in theOptunaSearchCV
is leftNone
:Environment
Error messages, stack traces, or logs
Steps to reproduce
df = load_dataset("iris") X = df.columns[:-1].tolist() y = "species"
param_grid = { "C": FloatDistribution(1e-5, 1e5, log=True), "gamma": FloatDistribution(1e-5, 1e5, log=True), }
study = optuna.create_study( direction="maximize", study_name="optuna-concept", load_if_exists=True, )
model = OptunaSearchCV(SVC(), param_grid, study=study)
model.fit(df[X], df[y])
model.fit(df[X], df[y])
Additional context (optional)
No response