philschmid / deep-learning-pytorch-huggingface

MIT License
580 stars 138 forks source link

CUDA OOM error while saving the model #16

Closed aasthavar closed 1 year ago

aasthavar commented 1 year ago

Hi @philschmid ! Thanks a lot for your blog on finetuning FLAN-T5-XXL with LoRA. I was trying the same on custom dataset. Some more details -

Dataset size = 6k records,
instance_type = AWS's ml.g5.16xlarge.
batch_size = 2,
gradient_accumulation_steps = 2
learning_rate = 1e-3,
num_train_epochs = 1

Training completes with this output - {'train_runtime': 1364.2004, 'train_samples_per_second': 0.733, 'train_steps_per_second': 0.183, 'train_loss': 1.278140380859375, 'epoch': 1.0}

But getting CUDA OOM error at the point of saving the model. Error -

ErrorMessage “OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 MiB (GPU 0; 22.19
GiB total capacity; 20.34 GiB already allocated; 32.50 MiB free; 20.96 GiB
reserved in total by PyTorch) If reserved memory is >> allocated memory try
setting max_split_size_mb to avoid fragmentation. See documentation for Memory
Management and PYTORCH_CUDA_ALLOC_CONF”

Code -

trainer.save_model(os.environ["SM_MODEL_DIR"])
tokenizer.save_pretrained(os.environ["SM_MODEL_DIR"])

Do you have any suggestion on how to solve this error ?

philschmid commented 1 year ago

Can you try down samplig your dataset to see if the script works and if the issue the size of your dataset?

aasthavar commented 1 year ago

I just tried with train dataset of 1000 samples. As well as batch_size=1, gradient_accumulation_steps = 4. But got the same error.

philschmid commented 1 year ago

Can you try with gradient_accumulation_steps=1? gradient_accumulation_steps increases the memory quite a bit. And which peft version are you using?

aasthavar commented 1 year ago

Sure, just started the training job. Will post the updates here as soon as its completed. Currently using peft version - 0.3.0

aasthavar commented 1 year ago

Got the CUDA error. Details -

"OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 22.19
 GiB total capacity; 20.52 GiB already allocated; 8.50 MiB free; 20.99 GiB
 reserved in total by PyTorch) If reserved memory is >> allocated memory try
 setting max_split_size_mb to avoid fragmentation.  See documentation for Memory
 Management and PYTORCH_CUDA_ALLOC_CONF
philschmid commented 1 year ago

Thats really weird can you try with 0.2.0 with peft? will try to rerun it my self.

aasthavar commented 1 year ago

Okay, Thank you !!

aasthavar commented 1 year ago

Hi @philschmid . It works, followed your suggestion !! This combination of peft==0.2.0 and accelerate==0.17.1 worked. I was using latest version of peft (0.3.0) and accelerate (0.19.0).

Final requirements.txt -

transformers==4.27.2
datasets==2.9.0
accelerate==0.17.1
evaluate==0.4.0
bitsandbytes==0.37.1
loralib
peft==0.2.0
pynvml

Thanks a lot for your suggestions !!

What I quite didn't understand is - how does CUDA error relate to bunch of libraries's versions. How could someone backtrack from the error to this solution ?

philschmid commented 1 year ago

Awesome. Pinned peft to 0.2.0

vineetsharma14 commented 9 months ago

Hello There,

I used these suggested version of the libraries, but it did not resolve the issue.

Training Args:

per_device_train_batch_size=1,
gradient_accumulation_steps=4

I am using bitsandbytes with below configurartion


bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True, 
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

On RTX 3090 24GB GPU.

But it does not resolve the issue.

The GPU VRAM utilisation gradually increases and then I get CUDA OOM error.

Any suggestions of how can I resolve this.

Thanks for the help.