🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)
Reproduction
import accelerate
from accelerate import DistributedDataParallelKwargs
from transformers import GPT2Model
accelerator = accelerate.Accelerator(kwargs_handlers=[ddp_kwargs])
model = GPT2Model.from_pretrained(args.model_dir,output_hidden_states = True)
if args.pretrain == 1 and args.freeze == 1:
peft_config = LoraConfig(
r=128,
lora_alpha=256,
lora_dropout=0.1,
)
model = get_peft_model(model, peft_config)
model = accelerator.prepare(model)
Expected behavior
Here is the information:
Traceback (most recent call last):
File "/workspace/Graph-Network/main.py", line 174, in <module>
model = accelerator.prepare(model)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1350, in prepare
result = tuple(
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1351, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1226, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1460, in prepare_model
model = model.to(self.device)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1173, in to
return self._apply(convert)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 779, in _apply
module._apply(fn)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 804, in _apply
param_applied = fn(param)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1159, in convert
return t.to(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
It's confusing that CUDA raise OOM but unlike others, it did not even try to allocate any GPU memory. In fact, my GPUs are empty according to nvidia-smi
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
Expected behavior
Here is the information:
It's confusing that CUDA raise OOM but unlike others, it did not even try to allocate any GPU memory. In fact, my GPUs are empty according to nvidia-smi