Closed chschroeder closed 1 year ago
Hello @chschroeder,
Thank you for the heads up!
You mention that repeated uses of trainer.train()
causes endless GPU memory growth. I'm trying to reproduce this in SetFit with a minimal example, but am unable to get these results. For example in the following script:
This constantly prints 892729856
(i.e. 0.9GB) as the allocated memory.
Similarly, if I run the hyperparameter tuning from the README, except on the sst2
dataset, with a fixed batch size of 8, just 1 epoch, and the sentence-transformers/paraphrase-mpnet-base-v2
model as the body, then I report values of around 0.9GB allocated throughout the entire hyperparameter tuning process of 20 trials.
Would it be possible to provide a script using either SetFit or a SentenceTransformer where the endless growth is reported? That way it can eventually be used to verify that a potential fix works or modified into a test over on the sentence-transformer repository. Feel free to post the response on the aforementioned issue on the sentence-transformers repository, too.
Would it be possible to provide a script using either SetFit or a SentenceTransformer where the endless growth is reported?
Sure! I am on it. My first attempt where I used a setfit example from scratch did not show the same behaviour, which is why I have to take a step back and take a closer look at my own code. I guess it might be related to the garbage collection because my own setup brings its own abstractions so the references might be cleaned up later. "Normal" transformer models as a cross check work flawlessly using this setup which is why I still think there is some problem elsewhere. I will investigate further and provide a script or notebook.
Update: I could not completely solve it yet, this is really strange.
I continued investigating my own example (active learning instead of the setfit-only example):
print('Before query:', torch.cuda.memory_stats()['inactive_split_bytes.all.current'] / 1024**2)
My assessment so far:
I might defer further investigation but I will report back in this issue once I find some time to continue investigating here.
I just ran into this in SageMaker Studio instance (4 vCPU + 16 GiB + 1 GPU). I trained a model before, using default dataset from Datasets
. My next attempt is using 8 samples each for 6 classes, with my custom text. I am curious to know if there's a max size limit for text
? and could that be an issue? Some of my texts are large (like 1 page of a PDF form). I killed my previous instance and started up a fresh instance but still running into this.
OutOfMemoryError: CUDA out of memory. Tried to allocate 72.00 MiB (GPU 0; 14.76 GiB total capacity; 13.93 GiB already allocated; 15.75 MiB free; 13.98 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
@anjanvb If I understand you correctly, your error does likely not occur over time but instantly. This is unrelated to the specific issue described above.
(You have, however, guessed correctly a) there is token limit, and b) the GPU memory usage scales with this text length. So your problem is likely caused by the length of your input data. Search for the arguments max_length
in SetFit and max_seq_length
in sentence transformers.)
Regarding the original issue: Sorry, I did not have time to investigate this further. Depending on your choice you can close the issue, as long as everything is documented here it might still be useful in the future.
I'll close this by now, as the issue will still be accessible here. Should you find anything new on the topic that is worth our attention, then do not hesitate to reopen this issue or make a new one altogether. I'm personally interested in any memory leak issues that may exist, and I'd like to thank you for the time that you spent to look into this.
Thank you, Tom! Unfortunately I could not manage to produce a failing self-contained example (given my limited current time budget at least). I think I know how to fix the problem (and that is by adding explicit delete statements) but such a code change is hard to justify if we reliably verify it. Also, the problem is located in sentence-transformers and not in setfit, but nevertheless, it is good that this is documented here as well.
For future reference, just two days ago I encountered the problem again, it is still there. This notebook failed in 3 out of 5 tries and succeeded otherwise. The notebook was run on a 2080 with 11GB.
I ran into this OOM error myself, doing HPO with SetFit.
Tried a lot of things etc.
TL;DR :
My case:
Poetry env, torch 2.1.1, SetFit 1.0.1, sentence-transformers 2.2.2, Jupyter notebook
(old!) Hardware ; 1080ti (11GB VRAM), 32gb RAM.
Model used : paraphrase-multilingual-mpnet-base-v2.
OOM error after 2 or 3 trials; could extend to 6-7trials by playing with smaller models (mini-LM) and the batch size.
OOM error using :
from optuna import Trial
# Optional, but for test purposes 8 ex. per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8, seed=40)
def model_init(params):
params = params or {}
max_iter = params.get("max_iter", 100)
solver = params.get("solver", "liblinear")
params = {
"head_params": {
"max_iter": max_iter,
"solver": solver,
}
}
return SetFitModel.from_pretrained("sentence-transformers/paraphrase-multilingual-mpnet-base-v2", **params)
def hp_space(trial):
""" Define hyperparams search space (Optuna) """
return {
# Embeddings fine-tuning phase params :
"body_learning_rate": trial.suggest_float("body_learning_rate", 1e-6, 1e-3, log=True),
"num_epochs": trial.suggest_int("num_epochs", 1, 3),
"batch_size": trial.suggest_categorical("batch_size", [16, 32]),
"seed": trial.suggest_int("seed", 1, 40),
# LogisticRegression head params :
"max_iter": trial.suggest_int("max_iter", 50, 300),
"solver": trial.suggest_categorical("solver", ["newton-cg", "liblinear","lbfgs"]),
}
trainer = Trainer(
model_init=model_init,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
metric="accuracy",
column_mapping={"comment": "text", "label": "label"},
)
best_run = trainer.hyperparameter_search(direction="maximize", hp_space=hp_space, n_trials=4)
I Could extent the trials to more (at least 10), doing some memory management.
Before that tested a lot of things : check of my dataset (french) & sentences structure, max_seq, smaller models & batch size etc.
It eventually will still runs OOM, and IMO, a better solution would prob. be tweaking the optimizer/scheduler as said in other threads.
(Partial) fix to extend the time before running OOM:
Overwrite run_hp_search_optuna
function, more specifically _objective
, adding memory management.
Memory management also added at model_init
(helps even more)
import gc
import torch
from optuna import Trial
from setfit import Trainer, SetFitModel, sample_dataset
import time
# Model initialization function
def model_init(params):
params = params or {}
max_iter = params.get("max_iter", 100)
solver = params.get("solver", "liblinear")
params = {
"head_params": {
"max_iter": max_iter,
"solver": solver,
}
}
# memory management
gc.collect()
torch.cuda.empty_cache()
return SetFitModel.from_pretrained("sentence-transformers/paraphrase-multilingual-mpnet-base-v2", **params)
# Hyperparameter space definition
def hp_space(trial):
""" Define hyperparams search space (Optuna) """
return {
# Embeddings fine-tuning phase params :
"body_learning_rate": trial.suggest_float("body_learning_rate", 1e-7 , 1e-5, log=True), # 1e-6, 1e-3
# "num_epochs": trial.suggest_int("num_epochs", 1, 2),
"max_steps": trial.suggest_int("max_steps", 650, 800), # 200, 900
"batch_size": trial.suggest_categorical("batch_size", [16]),
"seed": trial.suggest_int("seed", 1, 40),
# LogisticRegression head params :
"max_iter": trial.suggest_int("max_iter", 120, 126), # 100, 200
"solver": trial.suggest_categorical("solver", ["liblinear"]), # "newton-cg",'lbfgs'
}
# Customized run_hp_search_optuna function
def run_hp_search_optuna_modified(trainer, n_trials, direction, **kwargs):
import optuna
def _objective(trial):
trainer.objective = None
trainer.train(trial=trial)
# memory management
del trainer.model
gc.collect()
torch.cuda.empty_cache()
# Evaluate if needed
if getattr(trainer, "objective", None) is None:
metrics = trainer.evaluate()
trainer.objective = trainer.compute_objective(metrics)
return trainer.objective
timeout = kwargs.pop("timeout", None)
n_jobs = kwargs.pop("n_jobs", 1)
study = optuna.create_study(direction=direction, **kwargs)
# memory management : overkill, but also adding gc_after_trial=True in study.optimize()
study.optimize(_objective, n_trials=n_trials, timeout=timeout, n_jobs=n_jobs, gc_after_trial=True)
best_trial = study.best_trial
return BestRun(str(best_trial.number), best_trial.value, best_trial.params, study)
# Initialize Trainer
trainer = Trainer(
model_init=model_init,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
metric="accuracy",
column_mapping={"comment": "text", "label": "label"},
)
# Replace the run_hp_search_optuna method with the modified one
trainer.run_hp_search_optuna = run_hp_search_optuna_modified
# Run hyperparameter search
best_run = trainer.hyperparameter_search(direction="maximize", hp_space=hp_space, n_trials=3)
Thanks @matthieuvion but didn't seem to make much difference for me - every implementation seems to have memory issues.
Running 3 trials of HPO on 64 short paragraphs of text with intfloat/multilingual-e5-small
on a 4090:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 592.62 MiB is free. Including non-PyTorch memory, this process has 21.52 GiB memory in use. Of the allocated memory 20.92 GiB is allocated by PyTorch, and 156.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Maybe I'm missing something more fundamental?
Thanks @matthieuvion but didn't seem to make much difference for me - every implementation seems to have memory issues.
Running 3 trials of HPO on 64 short paragraphs of text with
intfloat/multilingual-e5-small
on a 4090:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 768.00 MiB. GPU 0 has a total capacity of 23.64 GiB of which 592.62 MiB is free. Including non-PyTorch memory, this process has 21.52 GiB memory in use. Of the allocated memory 20.92 GiB is allocated by PyTorch, and 156.86 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Maybe I'm missing something more fundamental?
Had the exact same error. Honestly I spent too much time understanding where it could come from, after a week of trials and your issue could be different. That's why I posted my half backed "solution" here still. It does not prevent saturating the VRAM eventually. Just did a quick search with "e5 OOM" and plenty of possibilities.
Setfit is a wonderful model/library with impressive few shot performance, as soon as I tried to get further (HPO, differentiable head, onnx conversion etc.) some workarounds can't be applied easily (at least by me) as it is a wrapper around transformers and some features are not yet implemented.
In my case I ended up doing HPO with the same model on multiple runs, and then with a smaller model, without OOM errors. I believe you can still retain some hyperparameters, particularly the one related to the logistic head (if that's the one you're using).
There are some other tricks you can try:
watch nvidia-smi
to see which other programs are using your GPU, then close unnecessary ones.os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:128"
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
use_amp=True
.batch_size=(4, 2)
config.json
file for num_hidden_layers
, num_attention_heads
, max_position_embeddings
, intermediate_size
, hidden_size
, etc. I'm experimenting with multilingual-e5-large-instruct
and it is very complex and demanding on memory. I usually start with small
-> base
-> large
.I don't really have time to dive into setfit source code, so can only say:
But you are right, if you're able to work within these limits, setfit is an awesome package. We use it a lot for training production models.
Hi,
When repeatedly using SetFit's
train()
/predict()
inside a loop (for active learning) the GPU memory usage steadily grows (despite that all results have been correctly transferred to the CPU). Eventually,OutOfMemoryError: CUDA out of memory.
is raised.This is caused by sentence-transformers and I have reported it here in detail: UKPLab/sentence-transformers/issues/1793
Just wanted to mention it here as well for the purposes of documentation.