openlm-research / open_llama

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset
Apache License 2.0
7.29k stars 372 forks source link

Unable to save checkpoints #50

Open canamika27 opened 1 year ago

canamika27 commented 1 year ago

Hi Team,

I was trying to finetune open-llama 7b on 20gb A100 with LORA with batch-size =1 & max_seq_lenth = 256 but while saving the checkpoints through huggingface transformers.trainer I am getting cuda out of memory.

As per my observation, the model & batch on total took around 10 GB vram & it was constant throughout the training but when trainer trying to save checkpoint at specific step its failing with cuda OOM. And when I tried the same finetuning code to META llama-7b it is working fine & checkpoints also getting save without any memory overhead .

As per https://github.com/openlm-research/open_llama/issues/1#issuecomment-1532311414 - if open llama-7b has same model size & architecture as meta llama-7b then why I am facing cuda OOM, ideally it should work same for both.

If anyone can look into it & help me out.