Cuda OOM when fine-tuning 13B

flyman3046 commented 1 year ago

I was able to fine-tune 7B model with one A100-40G GPU but ran into OOM when fine-tuning 13B.

Here is the error message:

│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/nn/modules.py:268 in _save_to_state_dict     │
│                                                                                                  │
│   265 │   │                                                                                      │
│   266 │   │   try:                                                                               │
│   267 │   │   │   if reorder_layout:                                                             │
│ ❱ 268 │   │   │   │   self.weight.data = undo_layout(self.state.CxB, self.state.tile_indices)    │
│   269 │   │   │                                                                                  │
│   270 │   │   │   super()._save_to_state_dict(destination, prefix, keep_vars)                    │
│   271                                                                                            │
│                                                                                                  │
│ /opt/conda/lib/python3.7/site-packages/bitsandbytes/autograd/_functions.py:100 in undo_layout    │
│                                                                                                  │
│    97 │   outputs[tile_indices.flatten()] = tensor                                               │
│    98 │   outputs = outputs.reshape(tile_rows, tile_cols, cols // tile_cols, rows // tile_rows   │
│    99 │   outputs = outputs.permute(3, 0, 2, 1)  # (rows // tile_rows, tile_rows), (cols // ti   │
│ ❱ 100 │   return outputs.reshape(rows, cols).contiguous()                                        │
│   101                                                                                            │
│   102                                                                                            │
│   103 class MatMul8bit(torch.autograd.Function):                                                 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 39.41 GiB total capacity; 35.83 GiB already allocated; 34.50 MiB free; 38.17 GiB reserved in
total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and 
PYTORCH_CUDA_ALLOC_CONF

My command is:

python finetune.py \
    --base_model='decapoda-research/llama-13b-hf' \
    --num_epochs=5 \
    --cutoff_len=512 \
    --group_by_length \
    --output_dir='./alpaca-lora-saved-model-13b' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8

I tried a couple of times, e.g., with a smaller cutoff_len but still got the same OOM error. One thing I noticed is that the issue happened after training ~10% steps. Any thoughts or help is greatly appreciated.

lksysML commented 1 year ago

Running into the same issue. Getting OOM after 7-10% while running on 4x A100-40GB.

Started at --micro_batch_size=24 and have been reducing it till 8 and it still OOMs at around 10%

Running with

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 torchrun finetune.py \
    --base_model="yahma/llama-13b-hf" \
    --num_epochs=5 \
    --cutoff_len=512 \ 
    --data_path="dataset1.json"
    --output_dir='./alpaca-lora-saved-model-13b' \
    --lora_target_modules='[q_proj,k_proj,v_proj,o_proj]' \
    --lora_r=16 \
    --micro_batch_size=8

Any ideas?

Error is

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 68.00 MiB (GPU 0; 39.44 GiB total capacity; 36.06 GiB already allocated; 19.88 MiB free; 37.97 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management

Should I play around with max_split_size_mb or should I look in another direction?

Dataset is around 60% larger than the latest alpaca_cleaned.

Another thing I noticed is the ETA for micro_batch_size=24 is is almost the same as for micro_batch_size=8.

flyman3046 commented 1 year ago

I retrained 7b without any issues. For 13B, I tried a couple of things but to no avail:

Use a smaller cutoff_len = 256
Use a smaller batch_size = 64

Here is the wandb link to a job: https://wandb.ai/zhengfei-hit/huggingface/runs/6mt3ilc0/overview?workspace=user-zhengfei-hit.

The job used only 45% GPU memory before OOM.

lksysML commented 1 year ago

Tried setting max_split_size_mb to 128mb and 64mb. Still didn't help, errors out at 10% when I think it is checkpointing or something

lksysML commented 1 year ago

Yes

I retrained 7b without any issues. For 13B, I tried a couple of things but to no avail:
1. Use a smaller `cutoff_len = 256`

2. Use a smaller `batch_size = 64`
Here is the wandb link to a job: https://wandb.ai/zhengfei-hit/huggingface/runs/6mt3ilc0/overview?workspace=user-zhengfei-hit.

The job used only 45% GPU memory before OOM.

It errored out at 10% after doing a checkpoint? (I think)

lksysML commented 1 year ago

Usually errors out when it reaches 200 iterations. @tloen What do you think?

I rented 8x RTX 3090 and getting same issue there. At 10% or 200 iterations it errors out

Always on 200 iterations...

{'loss': 1.5662, 'learning_rate': 2.3999999999999997e-05, 'epoch': 0.02}
{'loss': 1.521, 'learning_rate': 5.399999999999999e-05, 'epoch': 0.04}
{'loss': 1.3948, 'learning_rate': 8.4e-05, 'epoch': 0.06}
{'loss': 1.1799, 'learning_rate': 0.00011399999999999999, 'epoch': 0.08}
{'loss': 1.079, 'learning_rate': 0.00014399999999999998, 'epoch': 0.1}
{'loss': 1.0344, 'learning_rate': 0.00017399999999999997, 'epoch': 0.12}
{'loss': 1.0017, 'learning_rate': 0.000204, 'epoch': 0.14}
{'loss': 0.9883, 'learning_rate': 0.000234, 'epoch': 0.16}
{'loss': 0.9856, 'learning_rate': 0.00026399999999999997, 'epoch': 0.18}
{'loss': 0.968, 'learning_rate': 0.000294, 'epoch': 0.2}
{'loss': 0.9682, 'learning_rate': 0.00029830747531734835, 'epoch': 0.22}
{'loss': 0.965, 'learning_rate': 0.0002961918194640338, 'epoch': 0.24}
{'loss': 0.9425, 'learning_rate': 0.0002940761636107193, 'epoch': 0.26}
{'loss': 0.9679, 'learning_rate': 0.00029196050775740477, 'epoch': 0.28}
{'loss': 0.9681, 'learning_rate': 0.0002898448519040903, 'epoch': 0.3}
{'loss': 0.9561, 'learning_rate': 0.0002877291960507757, 'epoch': 0.32}
{'loss': 0.95, 'learning_rate': 0.0002856135401974612, 'epoch': 0.34}
{'loss': 0.9364, 'learning_rate': 0.00028349788434414665, 'epoch': 0.36}
{'loss': 0.9579, 'learning_rate': 0.00028138222849083215, 'epoch': 0.38}
{'loss': 0.9366, 'learning_rate': 0.0002792665726375176, 'epoch': 0.39}
{'eval_loss': 0.9465365409851074, 'eval_runtime': 43.0107, 'eval_samples_per_second': 46.5, 'eval_steps_per_second': 0.744, 'epoch': 0.39}
 13%|___________                                                                   | 200/1518 [34:55<3:48:14, 10.39s/it

lksysML commented 1 year ago

I was able to fix this issue by rolling back accelerate, peft, bitsandbytes and transformers to a commit dated around 5-6 april when my previous finetunes were successful. Didn't change any parameters and everything worked.

It's definitely an issue with one of these dependencies, need to pin point which one.

SerCeMan commented 1 year ago

Thank you, fixing the version of bitsandbytes to 0.37.2 resolved the issue for me. (https://github.com/TimDettmers/bitsandbytes/issues/324)

bitsandbytes==0.37.2

flyman3046 commented 1 year ago

Thanks @SerCeMan. Setting bitsandbytes==0.37.2 works for me. So closed it.

SerCeMan commented 1 year ago

Hey, @flyman3046! It might be worth keeping the issue open so that others who are likely to face the OOM issues can see it.

flyman3046 commented 1 year ago

@SerCeMan SG, re-opened it until the issue from bitsandbytes is fixed.

guihonghao commented 1 year ago

Thank you, fixing the version of bitsandbytes to 0.37.2 resolved the issue for me. (TimDettmers/bitsandbytes#324)
bitsandbytes==0.37.2

Yes, I meet an OOM when fine-tuning 13B on 2 * 3090 24GB. It seems happening while saving model.state_dict. And i solved it by pip bitsandbytes==0.37.2(My bitsandbytes version is 0.38.2, before)