Open bogedy opened 1 year ago
Hello!
Just intuitively, I wouldn't expect e1a5375aa89c9f676319b82a44d16d7afc45e6e7 to have fixed this issue. I'm aware that others have experienced OOM issues with the hyperparameter search, but I don't think anyone has successfully debugged it so far. With other words, I suspect the issue still persists.
Hi @bogedy, can I ask you how did you exactly apply the suggestion in https://github.com/huggingface/transformers/issues/13019?
I'm running into the same OutOfMemoryError
when doing hyperparameter search with Optuna, but I'm not sure about how to apply the suggestion in the issue you reference, since there is no checkpointing in SetFit's trainer. Please let me know.
Want to share what versions of SetFit, optuna and pytorch and which base model you're using so I can try to reproduce?
I had to edit the SetFit source code. It's in the second code block under "Updates to remedy the issue". Basically it's a hacky work around: the _objective
function gets called at the end of each trial to evaluate the trial. You add some code to it that deletes the model and runs the garbage collector, which is okay so long as that code comes after you run your evaluation.
This error is common enough with Optuna that they have some documentation on it and an argument to run gc automatically https://optuna.readthedocs.io/en/stable/faq.html#how-do-i-avoid-running-out-of-memory-oom-when-optimizing-studies
Still an active issue with optuna - not using huggingface, but just running a optuna hyperparameter optimization with big keras models is impossible because GPU memory allocator bugs out before a trial even begins, unless done on trivially small batch sizes.
HI! I had the same problem and got a working version by rewriting the hyperparameter_search function following this issue: https://github.com/huggingface/transformers/issues/13019
Just updated it according to the current state of the module:
def run_hp_search_optuna(trainer, n_trials, direction, **kwargs):
def _objective(trial, checkpoint_dir=None):
checkpoint = None
if checkpoint_dir:
for subdir in os.listdir(checkpoint_dir):
if subdir.startswith(PREFIX_CHECKPOINT_DIR):
checkpoint = os.path.join(checkpoint_dir, subdir)
if not checkpoint:
# free GPU memory
del trainer.model
gc.collect()
torch.cuda.empty_cache()
trainer.objective = None
trainer.train(resume_from_checkpoint=checkpoint, trial=trial)
# If there hasn't been any evaluation during the training loop.
if getattr(trainer, "objective", None) is None:
metrics = trainer.evaluate()
trainer.objective = trainer.compute_objective(metrics)
return trainer.objective
timeout = kwargs.pop("timeout", None)
n_jobs = kwargs.pop("n_jobs", 1)
study = optuna.create_study(direction=direction, **kwargs)
study.optimize(_objective, n_trials=n_trials,
timeout=timeout, n_jobs=n_jobs)
best_trial = study.best_trial
return BestRun(str(best_trial.number), best_trial.value, best_trial.params)
def hyperparameter_search(
self,
hp_space,
n_trials,
direction,
) -> Union[BestRun, List[BestRun]]:
trainer.hp_search_backend = HPSearchBackend.OPTUNA
self.hp_space = hp_space
trainer.hp_name = None
trainer.compute_objective = default_compute_objective
best_run = run_hp_search_optuna(trainer, n_trials, direction)
self.hp_search_backend = None
return best_run
Eventhough it is probably better if one just runs optuna with gc_after_trial=True
I had this problem and I see that in the repo's hyperparameter notebook someone else had this problem too! https://github.com/huggingface/setfit/blob/main/notebooks/text-classification_hyperparameter-search.ipynb
I fixed it by following this advice here https://github.com/huggingface/transformers/issues/13019
I wanted to make a pull request, but when I tried to reproduce the issue later (after pulling new changes) I couldn't. The memory use stayed constant over all the trials. Did e1a5375aa89c9f676319b82a44d16d7afc45e6e7 fix this? I'm curious. Would love to supply a PR if its helpful but maybe it's fixed already.