I was trying to finetune open-llama 7b on 20gb A100 with LORA with batch-size =1 & max_seq_lenth = 256 but while saving the checkpoints through huggingface transformers.trainer I am getting cuda out of memory.
As per my observation, the model & batch on total took around 10 GB vram & it was constant throughout the training but when trainer trying to save checkpoint at specific step its failing with cuda OOM.
And when I tried the same finetuning code to META llama-7b it is working fine & checkpoints also getting save without any memory overhead .
Hi Team,
I was trying to finetune open-llama 7b on 20gb A100 with LORA with batch-size =1 & max_seq_lenth = 256 but while saving the checkpoints through huggingface transformers.trainer I am getting cuda out of memory.
As per my observation, the model & batch on total took around 10 GB vram & it was constant throughout the training but when trainer trying to save checkpoint at specific step its failing with cuda OOM. And when I tried the same finetuning code to META llama-7b it is working fine & checkpoints also getting save without any memory overhead .
As per https://github.com/openlm-research/open_llama/issues/1#issuecomment-1532311414 - if open llama-7b has same model size & architecture as meta llama-7b then why I am facing cuda OOM, ideally it should work same for both.
If anyone can look into it & help me out.