[Usage] Using Deepspeed pretrainning errors

aprilehannibal commented 1 year ago

When did you clone our code?

I cloned the code base after 5/1/23

Describe the issue

Issue: When I use deepspeed zero3 to pretrainning LLaVA-13B on 4 A100（40G），I got an error shows below. It seems like when model parallelism, the clip parameters changed. When I use zero2, pretraining stage can sucessfully run. Because I need to pretrain 13B on a smaller GPU like 16 A10(24G), I must to use zero3. @haotian-liu

Command:

torchrun --nnodes=1 --nproc_per_node=4 --master_port=25001 \
    llava/train/train_mem.py \
    --model_name_or_path path/to/llava_13b \
    --data_path /path/to/LLaVA/LLaVA-CC3M-Pretrain-595K/chat.json \
    --image_folder /path/to/LLaVA-CC3M-Pretrain-595K/cc3m_595k_images \
    --vision_tower ./openai/clip-vit-large-patch14 \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end \
    --output_dir ./checkpoints/llava-13b-pretrain-deepspeed3 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 32 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --bf16 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --deepspeed ds_config_stage3.json

Screenshots:

Tomato1101 commented 1 year ago

I encountered the same error as you did when using deepspeed zero3. I have traced the underlying issue to torch.nn.Conv2d returning an empty tensor when initializing the vision model. Interestingly, torch.nn.Conv2d works well when generating a convolutional layer and returns a tensor with the right shape.

You can observe self.patch_embedding weights in your_path/transformers/models/clip/modeling_clip.py to potentially identify the error.

python==3.10 torch==1.13.1 transformers==4.28.0.dev0

haotian-liu commented 1 year ago

@Tomato1101 Can you share your deepspeed.json and your command as well? I'll try investigating this issue this week so would like to gather some sample scripts. Thanks.

Tomato1101 commented 1 year ago

I run the code in RTX 3090. This is my deepspeed configuration, which references the official tutorial.

{
    "train_micro_batch_size_per_gpu": "auto",
    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

and my launch command:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --num_nodes 1 --num_gpus 8 \
    llava/train/train_mem.py \
    --deepspeed deepspeed_config_stage3.json \
    --model_name_or_path path_to/models/vicuna/vicuna-7b \
    --version v0 \
    --data_path path_to/train_data/CC_3M_Concept_balanced_595K/chat.json \
    --image_folder path_to/train_data/CC_3M_Concept_balanced_595K/images \
    --vision_tower path_to/models/clip-vit-large-patch14 \
    --tune_mm_mlp_adapter True \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end \
    --bf16 True \
    --output_dir path_to/checkpoints/output/LLaVA-7B-pretrain \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2400 \
    --save_total_limit 1 \
    --learning_rate 2e-3 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \

aprilehannibal commented 1 year ago

You can observe self.patch_embedding weights in your_path/transformers/models/clip/modeling_clip.py to potentially identify the error.

@Tomato1101 Have you solve this error now?

Tomato1101 commented 1 year ago

You can observe self.patch_embedding weights in your_path/transformers/models/clip/modeling_clip.py to potentially identify the error.

@Tomato1101 Have you solve this error now?

Sorry, Not yet. I have successfully run llava in A100 without using deepspeed.

aprilehannibal commented 1 year ago

{ "train_micro_batch_size_per_gpu": "auto", "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }

@Tomato1101 @haotian-liu When I add bf16 enable in de ds_config，llava run successfully on 4 A100（40G）or 16 A10（24G）with Deepspeed Zero3. My ds_config shows below.

{
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": true
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "steps_per_print": 1e5,
  "wall_clock_breakdown": false
}

Tomato1101 commented 1 year ago

{ "train_micro_batch_size_per_gpu": "auto", "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }

@Tomato1101 @haotian-liu When I add bf16 enable in de ds_config，llava run successfully on 4 A100（40G）or 16 A10（24G）with Deepspeed Zero3. My ds_config shows below.
{
  "fp16": {
    "enabled": "auto",
    "loss_scale": 0,
    "loss_scale_window": 1000,
    "initial_scale_power": 16,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "bf16": {
    "enabled": true
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "steps_per_print": 1e5,
  "wall_clock_breakdown": false
}

Great, it works for me too.

XipengY commented 1 year ago

@Tomato1101 @aprilehannibal What's the difference between starting training using "torchrun" and "deepspeed"?

aprilehannibal commented 1 year ago

@XipengY I think there is no difference between them.

XipengY commented 1 year ago

@XipengY I think there is no difference between them.

@aprilehannibal Thanks, I got it.

@haotian-liu @Tomato1101 @aprilehannibal @abdul And Do you solved the above error (torch.nn.Conv2d returning an empty tensor) without bf16, For example, the V100 cannot support bf16.

aprilehannibal commented 1 year ago

@XipengY Maybe use fp16, or train on more V100 GPUs(like more than 16) with fp32.

XipengY commented 1 year ago

@XipengY Maybe use fp16, or train on more V100 GPUs(like more than 16) with fp32.

@aprilehannibal Thanks for your guidance, I'll try later.

uniquehou commented 6 months ago

You can observe self.patch_embedding weights in your_path/transformers/models/clip/modeling_clip.py to potentially identify the error.

@Tomato1101 Have you solve this error now?

Sorry, Not yet. I have successfully run llava in A100 without using deepspeed.

How do you train without deepspeed, can you show your training scripts and configuration？thanks

SharlotAway commented 5 months ago

@XipengY Did you solve the above error that torch.nn.Conv2d returning an empty tensor without bf16, I ran ZeRO-3 on 4*V100, and I used fp16. And the vision model's tensors are empty. ZeRO-2 seems to work but I got OOM problem.

haotian-liu / LLaVA

[Usage] Using Deepspeed pretrainning errors #194

When did you clone our code?

Describe the issue