Does anyone else get CUDA out of memory during hyperparameter search?

bogedy commented 1 year ago

I had this problem and I see that in the repo's hyperparameter notebook someone else had this problem too! https://github.com/huggingface/setfit/blob/main/notebooks/text-classification_hyperparameter-search.ipynb

I fixed it by following this advice here https://github.com/huggingface/transformers/issues/13019

I wanted to make a pull request, but when I tried to reproduce the issue later (after pulling new changes) I couldn't. The memory use stayed constant over all the trials. Did e1a5375aa89c9f676319b82a44d16d7afc45e6e7 fix this? I'm curious. Would love to supply a PR if its helpful but maybe it's fixed already.

tomaarsen commented 1 year ago

Hello!

Just intuitively, I wouldn't expect e1a5375aa89c9f676319b82a44d16d7afc45e6e7 to have fixed this issue. I'm aware that others have experienced OOM issues with the hyperparameter search, but I don't think anyone has successfully debugged it so far. With other words, I suspect the issue still persists.

Tom Aarsen

ayala-usma commented 1 year ago

Hi @bogedy, can I ask you how did you exactly apply the suggestion in https://github.com/huggingface/transformers/issues/13019?

I'm running into the same OutOfMemoryError when doing hyperparameter search with Optuna, but I'm not sure about how to apply the suggestion in the issue you reference, since there is no checkpointing in SetFit's trainer. Please let me know.

Aurelia

bogedy commented 1 year ago

Want to share what versions of SetFit, optuna and pytorch and which base model you're using so I can try to reproduce?

I had to edit the SetFit source code. It's in the second code block under "Updates to remedy the issue". Basically it's a hacky work around: the _objective function gets called at the end of each trial to evaluate the trial. You add some code to it that deletes the model and runs the garbage collector, which is okay so long as that code comes after you run your evaluation.

This error is common enough with Optuna that they have some documentation on it and an argument to run gc automatically https://optuna.readthedocs.io/en/stable/faq.html#how-do-i-avoid-running-out-of-memory-oom-when-optimizing-studies

azagajewski commented 1 year ago

Still an active issue with optuna - not using huggingface, but just running a optuna hyperparameter optimization with big keras models is impossible because GPU memory allocator bugs out before a trial even begins, unless done on trivially small batch sizes.

julioc-p commented 3 months ago

HI! I had the same problem and got a working version by rewriting the hyperparameter_search function following this issue: https://github.com/huggingface/transformers/issues/13019

Just updated it according to the current state of the module:

def run_hp_search_optuna(trainer, n_trials, direction, **kwargs):

    def _objective(trial, checkpoint_dir=None):
        checkpoint = None
        if checkpoint_dir:
            for subdir in os.listdir(checkpoint_dir):
                if subdir.startswith(PREFIX_CHECKPOINT_DIR):
                    checkpoint = os.path.join(checkpoint_dir, subdir)
        if not checkpoint:
            # free GPU memory
            del trainer.model
            gc.collect()
            torch.cuda.empty_cache()
        trainer.objective = None
        trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
        # If there hasn't been any evaluation during the training loop.
        if getattr(trainer, "objective", None) is None:
            metrics = trainer.evaluate()
            trainer.objective = trainer.compute_objective(metrics)
        return trainer.objective

    timeout = kwargs.pop("timeout", None)
    n_jobs = kwargs.pop("n_jobs", 1)
    study = optuna.create_study(direction=direction, **kwargs)
    study.optimize(_objective, n_trials=n_trials,
                   timeout=timeout, n_jobs=n_jobs)
    best_trial = study.best_trial
    return BestRun(str(best_trial.number), best_trial.value, best_trial.params)

def hyperparameter_search(
    self,
    hp_space,
    n_trials,
    direction,
) -> Union[BestRun, List[BestRun]]:

    trainer.hp_search_backend = HPSearchBackend.OPTUNA
    self.hp_space = hp_space
    trainer.hp_name = None
    trainer.compute_objective = default_compute_objective
    best_run = run_hp_search_optuna(trainer, n_trials, direction)
    self.hp_search_backend = None
    return best_run

julioc-p commented 3 months ago

Eventhough it is probably better if one just runs optuna with gc_after_trial=True

huggingface / setfit

Does anyone else get CUDA out of memory during hyperparameter search? #311