Open flyman3046 opened 1 year ago
Running into the same issue. Getting OOM after 7-10% while running on 4x A100-40GB.
Started at --micro_batch_size=24 and have been reducing it till 8 and it still OOMs at around 10%
Running with
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 torchrun finetune.py \
--base_model="yahma/llama-13b-hf" \
--num_epochs=5 \
--cutoff_len=512 \
--data_path="dataset1.json"
--output_dir='./alpaca-lora-saved-model-13b' \
--lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
--lora_r=16 \
--micro_batch_size=8
Any ideas?
Error is
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 39.44 GiB total capacity; 36.06 GiB already allocated; 19.88 MiB free; 37.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management
Should I play around with max_split_size_mb or should I look in another direction?
Dataset is around 60% larger than the latest alpaca_cleaned.
Another thing I noticed is the ETA for micro_batch_size=24 is is almost the same as for micro_batch_size=8.
I retrained 7b without any issues. For 13B, I tried a couple of things but to no avail:
cutoff_len = 256
batch_size = 64
Here is the wandb link to a job: https://wandb.ai/zhengfei-hit/huggingface/runs/6mt3ilc0/overview?workspace=user-zhengfei-hit.
The job used only 45% GPU memory before OOM.
Tried setting max_split_size_mb to 128mb and 64mb. Still didn't help, errors out at 10% when I think it is checkpointing or something
Yes
I retrained 7b without any issues. For 13B, I tried a couple of things but to no avail:
1. Use a smaller `cutoff_len = 256` 2. Use a smaller `batch_size = 64`
Here is the wandb link to a job: https://wandb.ai/zhengfei-hit/huggingface/runs/6mt3ilc0/overview?workspace=user-zhengfei-hit.
The job used only 45% GPU memory before OOM.
It errored out at 10% after doing a checkpoint? (I think)
Usually errors out when it reaches 200 iterations. @tloen What do you think?
I rented 8x RTX 3090 and getting same issue there. At 10% or 200 iterations it errors out
Always on 200 iterations...
{'loss': 1.5662, 'learning_rate': 2.3999999999999997e-05, 'epoch': 0.02}
{'loss': 1.521, 'learning_rate': 5.399999999999999e-05, 'epoch': 0.04}
{'loss': 1.3948, 'learning_rate': 8.4e-05, 'epoch': 0.06}
{'loss': 1.1799, 'learning_rate': 0.00011399999999999999, 'epoch': 0.08}
{'loss': 1.079, 'learning_rate': 0.00014399999999999998, 'epoch': 0.1}
{'loss': 1.0344, 'learning_rate': 0.00017399999999999997, 'epoch': 0.12}
{'loss': 1.0017, 'learning_rate': 0.000204, 'epoch': 0.14}
{'loss': 0.9883, 'learning_rate': 0.000234, 'epoch': 0.16}
{'loss': 0.9856, 'learning_rate': 0.00026399999999999997, 'epoch': 0.18}
{'loss': 0.968, 'learning_rate': 0.000294, 'epoch': 0.2}
{'loss': 0.9682, 'learning_rate': 0.00029830747531734835, 'epoch': 0.22}
{'loss': 0.965, 'learning_rate': 0.0002961918194640338, 'epoch': 0.24}
{'loss': 0.9425, 'learning_rate': 0.0002940761636107193, 'epoch': 0.26}
{'loss': 0.9679, 'learning_rate': 0.00029196050775740477, 'epoch': 0.28}
{'loss': 0.9681, 'learning_rate': 0.0002898448519040903, 'epoch': 0.3}
{'loss': 0.9561, 'learning_rate': 0.0002877291960507757, 'epoch': 0.32}
{'loss': 0.95, 'learning_rate': 0.0002856135401974612, 'epoch': 0.34}
{'loss': 0.9364, 'learning_rate': 0.00028349788434414665, 'epoch': 0.36}
{'loss': 0.9579, 'learning_rate': 0.00028138222849083215, 'epoch': 0.38}
{'loss': 0.9366, 'learning_rate': 0.0002792665726375176, 'epoch': 0.39}
{'eval_loss': 0.9465365409851074, 'eval_runtime': 43.0107, 'eval_samples_per_second': 46.5, 'eval_steps_per_second': 0.744, 'epoch': 0.39}
13%|___________ | 200/1518 [34:55<3:48:14, 10.39s/it
I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change any parameters and everything worked.
It's definitely an issue with one of these dependencies, need to pin point which one.
Thank you, fixing the version of bitsandbytes
to 0.37.2 resolved the issue for me. (https://github.com/TimDettmers/bitsandbytes/issues/324)
bitsandbytes==0.37.2
Thanks @SerCeMan. Setting bitsandbytes==0.37.2
works for me. So closed it.
Hey, @flyman3046! It might be worth keeping the issue open so that others who are likely to face the OOM issues can see it.
@SerCeMan SG, re-opened it until the issue from bitsandbytes
is fixed.
Thank you, fixing the version of
bitsandbytes
to 0.37.2 resolved the issue for me. (TimDettmers/bitsandbytes#324)bitsandbytes==0.37.2
Yes, I meet an OOM when fine-tuning 13B on 2 * 3090 24GB. It seems happening while saving model.state_dict. And i solved it by pip bitsandbytes==0.37.2
(My bitsandbytes version is 0.38.2, before)
Im on 0.37.2 and it still occurrs.
I can confirm, upgrading bitsandbytes to bitsandbytes==0.37.2 does NOT solve the problem.
The problem still happened when i changed bitsandbytes to v0.37.2
I agree; same problem here even with v0.37.2
I solve my problem adding theses variables to .bashrc
export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
remember to use source .bashrc
after.
Changing bitsandbytes==0.37.2 fixed the problem for me. I had bitsandbytes==0.39.0 earlier.
I was able to fine-tune 7B model with one A100-40G GPU but ran into OOM when fine-tuning 13B.
Here is the error message:
My command is:
I tried a couple of times, e.g., with a smaller
cutoff_len
but still got the same OOM error. One thing I noticed is that the issue happened after training ~10% steps. Any thoughts or help is greatly appreciated.