haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.35k stars 2.13k forks source link

[Usage] Out of memory for single-A100 LLaVA 1.5 w/ QLoRA and cpu-offload #697

Open HireTheHero opened 10 months ago

HireTheHero commented 10 months ago

Describe the issue

Issue:

After tens of batches, OOM shows when I try to fine-tune LLaVA 1.5 on single A100 w/ QLoRA and cpu-offloading.

Command:

MODEL_DIR="<path-to-model-dir>"
wget -P $MODEL_DIR \
    https://huggingface.co/liuhaotian/llava-v1.5-mlp2x-336px-pretrain-vicuna-13b-v1.5/resolve/main/mm_projector.bin

deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3_offload.json \
    --lora_enable True \
    --bits 4 \
    --model_name_or_path lmsys/vicuna-13b-v1.5 \
    --version v1 \
    --data_path ./playground/data/llava_v1_5_mix665k.json \
    --image_folder ./playground/data \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --pretrain_mm_mlp_adapter $MODEL_DIR/mm_projector.bin \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir ./checkpoints/llava-v1.5-13b \
    --num_train_epochs 1 \
    --per_device_train_batch_size 16 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Log:

{'loss': 0.891, 'learning_rate': 1.807714194192979e-05, 'epoch': 0.02}

  2%|▏         | 96/5197 [5:05:34<276:37:06, 195.22s/it]Traceback (most recent call last):
  File "/<LLaVA-path>/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/<LLaVA-path>/llava/train/train.py", line 1049, in train
    trainer.train()
  File "/<conda-path>/lib/python3.11/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/<conda-path>/lib/python3.11/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/<conda-path>/lib/python3.11/site-packages/transformers/trainer.py", line 2665, in training_step
    self.accelerator.backward(loss)
  File "/<conda-path>/lib/python3.11/site-packages/accelerate/accelerator.py", line 1847, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/<conda-path>/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/<conda-path>/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/<conda-path>/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 1861, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/<conda-path>/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
  File "/<conda-path>/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 1993, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/<conda-path>/lib/python3.11/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/<conda-path>/lib/python3.11/site-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/<conda-path>/lib/python3.11/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/<conda-path>/lib/python3.11/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/<conda-path>/lib/python3.11/site-packages/torch/utils/checkpoint.py", line 157, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/<conda-path>/lib/python3.11/site-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/<conda-path>/lib/python3.11/site-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
           ^^^^^^^^^^^^^^^^^^^^
  File "/<conda-path>/lib/python3.11/site-packages/torch/cuda/amp/autocast_mode.py", line 123, in decorate_bwd
    return bwd(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^
  File "/<conda-path>/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py", line 84, in backward
    grad_input = grad_output.matmul(weight)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.08 GiB (GPU 0; 44.40 GiB total capacity; 33.31 GiB already allocated; 1.00 GiB free; 42.75 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: - 0.008 MB of 0.008 MB uploaded (0.000 MB deduped)
wandb: \ 0.008 MB of 0.027 MB uploaded (0.000 MB deduped)
wandb: | 0.008 MB of 0.039 MB uploaded (0.000 MB deduped)
wandb: / 0.039 MB of 0.039 MB uploaded (0.000 MB deduped)
wandb: 
wandb: Run history:
wandb:         train/epoch ▁▁▁▁▁▁▁▁▁▁▁▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅████████
wandb:   train/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb: train/learning_rate ▁▃▃▄▅▅▅▅▆▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇████████████
wandb:          train/loss ▆▆█▆▇▅▆▆█▇▆▅▅▄▅▅▆▆▄▅▅▃▄▄▄▃▅▅▃▂▄▄▁▃▃▄▄▃▂▂
wandb: 
wandb: Run summary:
wandb:         train/epoch 0.02
wandb:   train/global_step 96
wandb: train/learning_rate 2e-05
wandb:          train/loss 0.891
wandb: 
wandb: 🚀 View run avid-terrain-7 at: https://wandb.ai/hire-the-hero/huggingface/runs/m2s53en7
wandb: ️⚡ View job at https://wandb.ai/hire-the-hero/huggingface/jobs/QXJ0aWZhY3RDb2xsZWN0aW9uOjEwOTg4MTM2NA==/version_details/v1
wandb: Synced 6 W&B file(s), 0 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20231028_231613-m2s53en7/logs
[2023-10-29 04:24:06,289] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 1508190
[2023-10-29 04:24:06,290] [ERROR] [launch.py:321:sigkill_handler] ['/<conda-path>/bin/python', '-u', 'llava/train/train_mem.py', '--local_rank=0', '--deepspeed', './scripts/zero3_offload.json', '--lora_enable', 'True', '--bits', '4', '--model_name_or_path', 'lmsys/vicuna-13b-v1.5', '--version', 'v1', '--data_path', './playground/data/llava_v1_5_mix665k.json', '--image_folder', './playground/data', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--pretrain_mm_mlp_adapter', '<path-to-model-dir>/mm_projector.bin', '--mm_projector_type', 'mlp2x_gelu', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', './checkpoints/llava-v1.5-13b', '--num_train_epochs', '1', '--per_device_train_batch_size', '16', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-5', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = 1

Screenshots:

N/A

Williamsunsir commented 10 months ago

请问您解决了这个问题了吗

HireTheHero commented 10 months ago

Asking me if I've solved this problem or not? No. Also trying 4*V100 but not working either.

simon-lund commented 6 months ago

It looks like you are training with a A100 40GB? If that's the case, you need to reduce the per_device_train_batch_size:

--per_device_train_batch_size 8 \
--gradient_accumulation_steps 16 \

To keep the global batch size at 128, you will have to update the gradient_accumulation_steps as well.

GLOBAL_BATCH_SIZE = NUM_GPUS PER_DEVICE_TRAIN_BATCH_SIZE GRADIENT_ACCUMULATION_STEPS