Open lopozz opened 1 month ago
Im am currently trying to run a kfold trining loop. At the end of each iteration I free memory using gc.collect() and torch.cuda.empty_cache() but seems not to do the job completely. I leave the code here:
gc.collect()
torch.cuda.empty_cache()
dataset = load_from_disk(os.path.join("data", cfg.kfold_kwargs.kfold_dataset_name)) folds = StratifiedKFold(n_splits=cfg.kfold_kwargs.n_splits) splits = list(folds.split(np.zeros(dataset.num_rows), dataset[cfg.label_column])) args = setfit.TrainingArguments(**cfg.train_kwargs) all_metrics = [] for train_idxs, test_idxs in splits: fold_dataset = DatasetDict( { "train": dataset.select(train_idxs), "test": dataset.select(test_idxs), } ) trainer = setfit.Trainer( model_init=model_init_fn(cfg.model_kwargs), args=args, train_dataset=fold_dataset["train"], eval_dataset=fold_dataset["test"], metric=custom_metrics_fn(fold_dataset, cfg.label_column), column_mapping={"text": "text", cfg.label_column: "label"}, ) trainer.train() # metrics = trainer.evaluate(fold_dataset["test"]) # all_metrics.append(metrics) # print(metrics) def memory_stats(): print(f"Memory allocated: {torch.cuda.memory_allocated()/1024**2}\nMemory reserved: {torch.cuda.memory_reserved()/1024**2}") memory_stats() del trainer.model.model_head, trainer.model.model_body del fold_dataset, trainer # torch.cuda.synchronize() gc.collect() torch.cuda.empty_cache() memory_stats() print('\n')
and my setup:
accelerate==1.0.1 aiohappyeyeballs==2.4.3 aiohttp==3.10.10 aiosignal==1.3.1 alembic==1.13.3 antlr4-python3-runtime==4.9.3 async-timeout==4.0.3 attrs==24.2.0 certifi==2024.8.30 charset-normalizer==3.4.0 colorlog==6.8.2 datasets==3.0.1 dill==0.3.8 evaluate==0.4.3 filelock==3.16.1 frozenlist==1.4.1 fsspec==2024.6.1 greenlet==3.1.1 huggingface-hub==0.26.1 hydra-core==1.3.2 idna==3.10 Jinja2==3.1.4 joblib==1.4.2 Mako==1.3.6 MarkupSafe==3.0.2 mpmath==1.3.0 multidict==6.1.0 multiprocess==0.70.16 networkx==3.4.2 numpy==2.1.2 nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 omegaconf==2.3.0 optuna==4.0.0 packaging==24.1 pandas==2.2.3 pillow==11.0.0 propcache==0.2.0 psutil==6.1.0 pyarrow==17.0.0 python-dateutil==2.9.0.post0 pytz==2024.2 PyYAML==6.0.2 regex==2024.9.11 requests==2.32.3 safetensors==0.4.5 scikit-learn==1.5.2 scipy==1.14.1 sentence-transformers==3.2.1 setfit==1.1.0 six==1.16.0 SQLAlchemy==2.0.36 sympy==1.13.1 threadpoolctl==3.5.0 tokenizers==0.20.1 torch==2.5.0 tqdm==4.66.5 transformers==4.45.2 triton==3.1.0 typing_extensions==4.12.2 tzdata==2024.2 urllib3==2.2.3 xxhash==3.5.0 yarl==1.16.0
I also leave the memory printed at each iteration:
Memory allocated: 279.685546875 Memory reserved: 596.0 Memory allocated: 279.685546875 Memory reserved: 342.0
Memory allocated: 411.4501953125 Memory reserved: 738.0 Memory allocated: 411.4501953125 Memory reserved: 484.0
Memory allocated: 542.93359375 Memory reserved: 876.0 Memory allocated: 542.93359375 Memory reserved: 626.0
Memory allocated: 674.4638671875 Memory reserved: 1052.0 Memory allocated: 674.4638671875 Memory reserved: 780.0
Does anyone could suggest the reason?
Currently get the same issue..
Did you get the solution lately? @lopozz
Im am currently trying to run a kfold trining loop. At the end of each iteration I free memory using
gc.collect()
andtorch.cuda.empty_cache()
but seems not to do the job completely. I leave the code here:and my setup:
I also leave the memory printed at each iteration:
Memory allocated: 279.685546875 Memory reserved: 596.0 Memory allocated: 279.685546875 Memory reserved: 342.0
Memory allocated: 411.4501953125 Memory reserved: 738.0 Memory allocated: 411.4501953125 Memory reserved: 484.0
Memory allocated: 542.93359375 Memory reserved: 876.0 Memory allocated: 542.93359375 Memory reserved: 626.0
Memory allocated: 674.4638671875 Memory reserved: 1052.0 Memory allocated: 674.4638671875 Memory reserved: 780.0
Does anyone could suggest the reason?