haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.46k stars 2.26k forks source link

Deepspeed Assertion Error after training is completed while saving check points #1583

Open subiaansari opened 5 months ago

subiaansari commented 5 months ago

Describe the issue

Issue: This error happens only for epochs >= 4, and number of image samples in fine-tuning dataset >= 41

Command:

deepspeed llava/train/train_mem.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./scripts/zero3.json \
    --model_name_or_path liuhaotian/llava-v1.5-7b \
    --version v1 \
    --data_path ./data/train_size_chart_ocr_40.json \
    --image_folder ./data \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --image_aspect_ratio pad \
    --group_by_modality_length True \
    --bf16 True \
    --output_dir llava-lora-outputs \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 50000 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to wandb

Log:

{'loss': 0.2064, 'learning_rate': 0.0002, 'epoch': 1.0}                                                                                                               
{'loss': 0.2073, 'learning_rate': 0.00017071067811865476, 'epoch': 2.0}                                                                                               
{'loss': 0.1385, 'learning_rate': 0.0001, 'epoch': 2.67}                                                                                                              
{'loss': 0.0457, 'learning_rate': 2.9289321881345254e-05, 'epoch': 3.0}                                                                                               
{'loss': 0.1384, 'learning_rate': 0.0, 'epoch': 4.0}                                                                                                                  
{'train_runtime': 240.1224, 'train_samples_per_second': 0.833, 'train_steps_per_second': 0.021, 'train_loss': 0.1472586028277874, 'epoch': 4.0}                       
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [01:39<00:00, 19.82s/it]
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train_mem.py", line 4, in <module>
    train(attn_implementation="flash_attention_2")
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 978, in train
    non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 159, in get_peft_state_non_lora_maybe_zero_3
    to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 159, in <dictcomp>
    Traceback (most recent call last):
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train_mem.py", line 4, in <module>

  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 122, in maybe_zero_3
    with zero.GatheredParameters([param]):
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2129, in __exit__
    train(attn_implementation="flash_attention_2")
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 978, in train
    non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 159, in get_peft_state_non_lora_maybe_zero_3
    to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 159, in <dictcomp>
    to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 122, in maybe_zero_3
    self.params[0].partition(param_list=self.params, has_been_updated=False)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1275, in partition
        self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1424, in _partition
with zero.GatheredParameters([param]):    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn

  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2129, in __exit__
        ret_val = func(*args, **kwargs)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1457, in _partition_param
self.params[0].partition(param_list=self.params, has_been_updated=False)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1275, in partition
    free_param(param)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 285, in free_param
    self._partition(param_list, has_been_updated=has_been_updated)
    assert not param.ds_active_sub_modules, param.ds_summary()
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1424, in _partition
AssertionError: {'id': 290, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (4096, 1024), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2770}, 'ds_tensor.shape': torch.Size([1048576])}
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1457, in _partition_param
    free_param(param)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 285, in free_param
    assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 290, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (4096, 1024), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2770}, 'ds_tensor.shape': torch.Size([1048576])}
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train_mem.py", line 4, in <module>
    train(attn_implementation="flash_attention_2")
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 978, in train
    non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 159, in get_peft_state_non_lora_maybe_zero_3
    to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 159, in <dictcomp>
    to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 122, in maybe_zero_3
    with zero.GatheredParameters([param]):
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2129, in __exit__
    self.params[0].partition(param_list=self.params, has_been_updated=False)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1275, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1424, in _partition
Traceback (most recent call last):
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1457, in _partition_param
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train_mem.py", line 4, in <module>
    train(attn_implementation="flash_attention_2")
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 978, in train
    non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(
    free_param(param)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
      File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 159, in get_peft_state_non_lora_maybe_zero_3
    to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
ret_val = func(*args, **kwargs)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 285, in free_param
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 159, in <dictcomp>
    to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
    assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 290, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (4096, 1024), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2770}, 'ds_tensor.shape': torch.Size([1048576])}
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/llava/train/train.py", line 122, in maybe_zero_3
    with zero.GatheredParameters([param]):
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2129, in __exit__
    self.params[0].partition(param_list=self.params, has_been_updated=False)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1275, in partition
    self._partition(param_list, has_been_updated=has_been_updated)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1424, in _partition
    self._partition_param(param, has_been_updated=has_been_updated)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1457, in _partition_param
    free_param(param)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 285, in free_param
    assert not param.ds_active_sub_modules, param.ds_summary()
AssertionError: {'id': 290, 'status': 'AVAILABLE', 'numel': 4194304, 'ds_numel': 4194304, 'shape': (4096, 1024), 'ds_shape': (4096, 1024), 'requires_grad': True, 'grad_shape': None, 'persist': False, 'active_sub_modules': {2770}, 'ds_tensor.shape': torch.Size([1048576])}
wandb: 
wandb: Run history:
wandb:                    train/epoch ▁▃▅▆██
wandb:              train/global_step ▁▃▅▆██
wandb:            train/learning_rate █▇▅▂▁
wandb:                     train/loss ██▅▁▅
wandb:               train/total_flos ▁
wandb:               train/train_loss ▁
wandb:            train/train_runtime ▁
wandb: train/train_samples_per_second ▁
wandb:   train/train_steps_per_second ▁
wandb: 
wandb: Run summary:
wandb:                    train/epoch 4.0
wandb:              train/global_step 5
wandb:            train/learning_rate 0.0
wandb:                     train/loss 0.1384
wandb:               train/total_flos 3148619579392.0
wandb:               train/train_loss 0.14726
wandb:            train/train_runtime 240.1224
wandb: train/train_samples_per_second 0.833
wandb:   train/train_steps_per_second 0.021
wandb: 
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /home/ec2-user/SageMaker/de_expansion/llava_dry_run/size-chart-fine-tune-llava-1.5-7b/LLaVA/wandb/offline-run-20240630_205842-d744f7k2
wandb: Find logs at: ./wandb/offline-run-20240630_205842-d744f7k2/logs
wandb: WARNING The new W&B backend becomes opt-out in version 0.18.0; try it out with `wandb.require("core")`! See https://wandb.me/wandb-core for more information.
[2024-06-30 21:00:32,528] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 26200
[2024-06-30 21:00:32,937] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 26201
[2024-06-30 21:00:32,945] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 26202
[2024-06-30 21:00:32,945] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 26203
[2024-06-30 21:00:32,953] [ERROR] [launch.py:321:sigkill_handler] ['/home/ec2-user/SageMaker/de_expansion/llava_dry_run/amazon-sagemaker-finetune-/llava_env/bin/python3', '-u', 'llava/train/train_mem.py', '--local_rank=3', '--lora_enable', 'True', '--lora_r', '128', '--lora_alpha', '256', '--mm_projector_lr', '2e-5', '--deepspeed', './scripts/zero3.json', '--model_name_or_path', 'liuhaotian/llava-v1.5-7b', '--version', 'v1', '--data_path', './data/train_size_chart_ocr_40.json', '--image_folder', './data', '--vision_tower', 'openai/clip-vit-large-patch14-336', '--mm_projector_type', 'mlp2x_gelu', '--mm_vision_select_layer', '-2', '--mm_use_im_start_end', 'False', '--mm_use_im_patch_token', 'False', '--image_aspect_ratio', 'pad', '--group_by_modality_length', 'True', '--bf16', 'True', '--output_dir', 'llava-lora-outputs', '--num_train_epochs', '5', '--per_device_train_batch_size', '4', '--per_device_eval_batch_size', '4', '--gradient_accumulation_steps', '8', '--evaluation_strategy', 'no', '--save_strategy', 'steps', '--save_steps', '50000', '--save_total_limit', '1', '--learning_rate', '2e-4', '--weight_decay', '0.', '--warmup_ratio', '0.03', '--lr_scheduler_type', 'cosine', '--logging_steps', '1', '--tf32', 'True', '--model_max_length', '2048', '--gradient_checkpointing', 'True', '--dataloader_num_workers', '4', '--lazy_preprocess', 'True', '--report_to', 'wandb'] exits with return code = 1
takezoe929 commented 2 weeks ago

Have you solved it yet? I meet the same problem