Trainer does not release all CUDA memory

lopozz commented 1 month ago

Im am currently trying to run a kfold trining loop. At the end of each iteration I free memory using gc.collect() and torch.cuda.empty_cache() but seems not to do the job completely. I leave the code here:

dataset = load_from_disk(os.path.join("data", cfg.kfold_kwargs.kfold_dataset_name))

folds = StratifiedKFold(n_splits=cfg.kfold_kwargs.n_splits)
splits = list(folds.split(np.zeros(dataset.num_rows), dataset[cfg.label_column]))

args = setfit.TrainingArguments(**cfg.train_kwargs)

all_metrics = []

for train_idxs, test_idxs in splits:
    fold_dataset = DatasetDict(
        {
            "train": dataset.select(train_idxs),
            "test": dataset.select(test_idxs),
        }
    )

    trainer = setfit.Trainer(
        model_init=model_init_fn(cfg.model_kwargs),
        args=args,
        train_dataset=fold_dataset["train"],
        eval_dataset=fold_dataset["test"],
        metric=custom_metrics_fn(fold_dataset, cfg.label_column),
        column_mapping={"text": "text", cfg.label_column: "label"},
    )

    trainer.train()

    # metrics = trainer.evaluate(fold_dataset["test"])
    # all_metrics.append(metrics)
    # print(metrics)

    def memory_stats():
        print(f"Memory allocated: {torch.cuda.memory_allocated()/1024**2}\nMemory reserved: {torch.cuda.memory_reserved()/1024**2}")

    memory_stats()

    del trainer.model.model_head, trainer.model.model_body
    del fold_dataset, trainer
    # torch.cuda.synchronize()
    gc.collect()
    torch.cuda.empty_cache()

    memory_stats()
    print('\n')

and my setup:

accelerate==1.0.1
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiosignal==1.3.1
alembic==1.13.3
antlr4-python3-runtime==4.9.3
async-timeout==4.0.3
attrs==24.2.0
certifi==2024.8.30
charset-normalizer==3.4.0
colorlog==6.8.2
datasets==3.0.1
dill==0.3.8
evaluate==0.4.3
filelock==3.16.1
frozenlist==1.4.1
fsspec==2024.6.1
greenlet==3.1.1
huggingface-hub==0.26.1
hydra-core==1.3.2
idna==3.10
Jinja2==3.1.4
joblib==1.4.2
Mako==1.3.6
MarkupSafe==3.0.2
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.16
networkx==3.4.2
numpy==2.1.2
nvidia-cublas-cu12==12.4.5.8
nvidia-cuda-cupti-cu12==12.4.127
nvidia-cuda-nvrtc-cu12==12.4.127
nvidia-cuda-runtime-cu12==12.4.127
nvidia-cudnn-cu12==9.1.0.70
nvidia-cufft-cu12==11.2.1.3
nvidia-curand-cu12==10.3.5.147
nvidia-cusolver-cu12==11.6.1.9
nvidia-cusparse-cu12==12.3.1.170
nvidia-nccl-cu12==2.21.5
nvidia-nvjitlink-cu12==12.4.127
nvidia-nvtx-cu12==12.4.127
omegaconf==2.3.0
optuna==4.0.0
packaging==24.1
pandas==2.2.3
pillow==11.0.0
propcache==0.2.0
psutil==6.1.0
pyarrow==17.0.0
python-dateutil==2.9.0.post0
pytz==2024.2
PyYAML==6.0.2
regex==2024.9.11
requests==2.32.3
safetensors==0.4.5
scikit-learn==1.5.2
scipy==1.14.1
sentence-transformers==3.2.1
setfit==1.1.0
six==1.16.0
SQLAlchemy==2.0.36
sympy==1.13.1
threadpoolctl==3.5.0
tokenizers==0.20.1
torch==2.5.0
tqdm==4.66.5
transformers==4.45.2
triton==3.1.0
typing_extensions==4.12.2
tzdata==2024.2
urllib3==2.2.3
xxhash==3.5.0
yarl==1.16.0

I also leave the memory printed at each iteration:

Memory allocated: 279.685546875 Memory reserved: 596.0 Memory allocated: 279.685546875 Memory reserved: 342.0

Memory allocated: 411.4501953125 Memory reserved: 738.0 Memory allocated: 411.4501953125 Memory reserved: 484.0

Memory allocated: 542.93359375 Memory reserved: 876.0 Memory allocated: 542.93359375 Memory reserved: 626.0

Memory allocated: 674.4638671875 Memory reserved: 1052.0 Memory allocated: 674.4638671875 Memory reserved: 780.0

Does anyone could suggest the reason?

muhammadravi251001 commented 3 days ago

Currently get the same issue..

muhammadravi251001 commented 2 days ago

Did you get the solution lately? @lopozz

huggingface / setfit

Trainer does not release all CUDA memory #567