keras-team / keras-tuner

A Hyperparameter Tuning Library for Keras
https://keras.io/keras_tuner/
Apache License 2.0
2.86k stars 396 forks source link

Disk usage #288

Open romanovzky opened 4 years ago

romanovzky commented 4 years ago

Hi there, I have been having an issue with searches easily going up and beyond a few tens of GB, which on my current setup becomes prohibitive when I have multiple projects, etc. I was wondering if it were possible for the Tuners to delete the trials that have been already completely discarded as it goes in order to save disk space. I reckon that an optional parameter in the Tuner class would keep the current functionality for those who wish for. Of course, this assumes that trials can be completely discarded as we go, which I am not completely sure if that's how the Tuners are implemented. Cheers

m-zheng commented 4 years ago

@romanovzky I had the same issue. I ended up writing a shell script to delete the checkpoints periodically.

romanovzky commented 4 years ago

@m-zheng I thought about a similar solution, but I don't know which checkpoints are not being used anymore or won't be used for the next iterations. How did you solve that issue?

m-zheng commented 4 years ago

@romanovzky In my case, none of my checkpoints will be reused. After tuning parameters, I use code below to fetch the best hyper-parameters which can be used to build the best model later on. It works for me atm. I may start looking for other solutions if I need more flexibility. best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

romanovzky commented 4 years ago

Ok, so this is a posteriori solution. My problem is that the growing storage space during training is preventing me from finalising a full optimisation round.

alexandrehuat commented 4 years ago

It is an important problem indeed. The optimal solution seems to depend on the tuner that you use. But in all cases I think a quick fix would be to do several searches with small hyperparameters spaces instead of one search within a big space.

  1. Do a small search.
  2. Save the best model/HPs of the search, delete the files of the trials.
  3. Reiterate 1 and 2 with new HPs spaces.
  4. Keep only the best model/HPs of all the searches.

Eventually, IMHO the best solution would be that Keras Tuner method tuner.search() implements a parameter like keep_trials: Optional[int] so that files of the trials that are not in the best keep_trials trials are automatically deleted ASAP. (with default = None = no file is deleted).

larsmathuseck commented 2 years ago

I'm running into the same issue with disk usage. Currently I have to split my search space and delete the checkpoint files manually after each sub-search.

Any updates / solutions here?

leleogere commented 2 years ago

Having a flag to disable keeping checkpoints for finished trials would be very helpful. I don't need to keep the trained model, I just need the best HP combination that I'll use later on to train the best model myself.

haifeng-jin commented 1 year ago

Yes, this is an important issue. The disk usage should be aggressive as not saving anything but the best hps, all the way to save everything. In the middle, the user should also be allowed to only save the best model and the best hps.

aleon1138 commented 1 year ago

I ran into the same issue, and the work-around is to subclass the Tuner class and re-implement run_trial, so you can bypass the checkpoint mechanism.

class BayesianTuner(keras_tuner.BayesianOptimization):
    def __init__(self, hypermodel, **kwargs):
        super().__init__(hypermodel, **kwargs)

    def run_trial(self, trial, *args, **kwargs):
        hp = trial.hyperparameters
        model = self.hypermodel.build(hp)
        return self.hypermodel.fit(hp, model, *args, **kwargs)

This will work exactly as before, but no checkpoints are saved. But the trial information is still stored, so you will not lose any work already done with previous successful trials.

Hope this helps!

FareedFarag commented 1 year ago

In addition to @aleon1138 answer, if you want to properly set up tensorboard's file directory to eventually visualize the tuning process, consider this subclass instead:

import keras_tuner as kt
from tensorboard.plugins.hparams import api as hparams_api
import copy

class BayesianTuner(kt.BayesianOptimization):
    def __init__(self, hypermodel, **kwargs):
        super().__init__(hypermodel, **kwargs)

    def run_trial(self, trial, *args, **kwargs):
        original_callbacks = kwargs.pop("callbacks", [])

        # Run the training process multiple times.
        histories = []
        for execution in range(self.executions_per_trial):
            copied_kwargs = copy.copy(kwargs)
            callbacks = self._deepcopy_callbacks(original_callbacks)
            self._configure_tensorboard_dir(callbacks, trial, execution)
            callbacks.append(kt.engine.tuner_utils.TunerCallback(self, trial))
            # Only checkpoint the best epoch across all executions.
            # callbacks.append(model_checkpoint)
            copied_kwargs["callbacks"] = callbacks
            obj_value = self._build_and_fit_model(trial, *args, **copied_kwargs)

            histories.append(obj_value)
        return histories

    def on_batch_begin(self, trial, model, batch, logs):
        pass

    def on_batch_end(self, trial, model, batch, logs=None):
        pass

    def on_epoch_begin(self, trial, model, epoch, logs=None):
        pass

    def on_epoch_end(self, trial, model, epoch, logs=None):
        pass

    def get_best_models(self, num_models=1):
        return super().get_best_models(num_models)

    def _deepcopy_callbacks(self, callbacks):
        try:
            callbacks = copy.deepcopy(callbacks)
        except:
            raise kt.errors.FatalValueError(
                "All callbacks used during a search "
                "should be deep-copyable (since they are "
                "reused across trials). "
                "It is not possible to do `copy.deepcopy(%s)`" % (callbacks,)
            )
        return callbacks

    def _configure_tensorboard_dir(self, callbacks, trial, execution=0):
        for callback in callbacks:
            if callback.__class__.__name__ == "TensorBoard":
                # Patch TensorBoard log_dir and add HParams KerasCallback
                logdir = self._get_tensorboard_dir(
                    callback.log_dir, trial.trial_id, execution
                )
                callback.log_dir = logdir
                hparams = kt.engine.tuner_utils.convert_hyperparams_to_hparams(
                    trial.hyperparameters
                )
                callbacks.append(
                    hparams_api.KerasCallback(
                        writer=logdir, hparams=hparams, trial_id=trial.trial_id
                    )
                )

    def _get_tensorboard_dir(self, logdir, trial_id, execution):
        return os.path.join(str(logdir), str(trial_id), f"execution{str(execution)}")

    def _get_checkpoint_fname(self, trial_id):
        return os.path.join(
            # Each checkpoint is saved in its own directory.
            self.get_trial_dir(trial_id),
            "checkpoint",
        )

Hope that helps.

jessicapatricoski commented 1 year ago

I'm not sure if this will work for all of the tuners, but the following (combined with the tuner subclassing shown by @aleon1138) does work for GridSearch if you want to avoid the trial folders entirely and just get the hyperparameters and their scores: I (temporarily) commented out line 649 of oracle.py

def _save_trial(self, trial):
    # Write trial status to trial directory
    trial_id = trial.trial_id
    #trial.save(os.path.join(self._get_trial_dir(trial_id), "trial.json"))

and added the following to the bottom of end_trial() in the same file

@synchronized
def end_trial(self, trial):
      ...

    loss = round(trial.metrics.get_config()["metrics"]["loss"]["observations"][0]["value"][0], 5)
    auc = round(trial.metrics.get_config()["metrics"]["auc"]["observations"][0]["value"][0], 5)
    val_loss = round(trial.metrics.get_config()["metrics"]["val_loss"]["observations"][0]["value"][0], 5)
    val_auc = round(trial.metrics.get_config()["metrics"]["val_auc"]["observations"][0]["value"][0], 5)
    results = trial.hyperparameters.get_config()["values"] | {"auc": auc, "val_auc": val_auc, "loss": loss, "val_loss": val_loss}
    print("\t".join([str(i) for i in results.values()]))

Then I read the search output into a dataframe, sorted it, grabbed the best hyperparameters, and put the files back to normal.