Open aprilehannibal opened 1 year ago
I encountered the same error as you did when using deepspeed zero3. I have traced the underlying issue to torch.nn.Conv2d
returning an empty tensor when initializing the vision model. Interestingly, torch.nn.Conv2d
works well when generating a convolutional layer and returns a tensor with the right shape.
You can observe self.patch_embedding
weights in your_path/transformers/models/clip/modeling_clip.py
to potentially identify the error.
python==3.10 torch==1.13.1 transformers==4.28.0.dev0
@Tomato1101 Can you share your deepspeed.json and your command as well? I'll try investigating this issue this week so would like to gather some sample scripts. Thanks.
I run the code in RTX 3090. This is my deepspeed configuration, which references the official tutorial.
{
"train_micro_batch_size_per_gpu": "auto",
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
}
}
and my launch command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed --num_nodes 1 --num_gpus 8 \
llava/train/train_mem.py \
--deepspeed deepspeed_config_stage3.json \
--model_name_or_path path_to/models/vicuna/vicuna-7b \
--version v0 \
--data_path path_to/train_data/CC_3M_Concept_balanced_595K/chat.json \
--image_folder path_to/train_data/CC_3M_Concept_balanced_595K/images \
--vision_tower path_to/models/clip-vit-large-patch14 \
--tune_mm_mlp_adapter True \
--mm_vision_select_layer -2 \
--mm_use_im_start_end \
--bf16 True \
--output_dir path_to/checkpoints/output/LLaVA-7B-pretrain \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 2400 \
--save_total_limit 1 \
--learning_rate 2e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--lazy_preprocess True \
You can observe
self.patch_embedding
weights inyour_path/transformers/models/clip/modeling_clip.py
to potentially identify the error.
@Tomato1101 Have you solve this error now?
You can observe
self.patch_embedding
weights inyour_path/transformers/models/clip/modeling_clip.py
to potentially identify the error.@Tomato1101 Have you solve this error now?
Sorry, Not yet. I have successfully run llava in A100 without using deepspeed.
{ "train_micro_batch_size_per_gpu": "auto", "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }
@Tomato1101 @haotian-liu When I add bf16 enable in de ds_config,llava run successfully on 4 A100(40G)or 16 A10(24G)with Deepspeed Zero3. My ds_config shows below.
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"steps_per_print": 1e5,
"wall_clock_breakdown": false
}
{ "train_micro_batch_size_per_gpu": "auto", "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }
@Tomato1101 @haotian-liu When I add bf16 enable in de ds_config,llava run successfully on 4 A100(40G)or 16 A10(24G)with Deepspeed Zero3. My ds_config shows below.
{ "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": true }, "optimizer": { "type": "AdamW", "params": { "lr": "auto", "betas": "auto", "eps": "auto", "weight_decay": "auto" } }, "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true }, "offload_param": { "device": "cpu", "pin_memory": true }, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "gather_16bit_weights_on_model_save": true }, "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "steps_per_print": 1e5, "wall_clock_breakdown": false }
Great, it works for me too.
@Tomato1101 @aprilehannibal What's the difference between starting training using "torchrun" and "deepspeed"?
@XipengY I think there is no difference between them.
@XipengY I think there is no difference between them.
@aprilehannibal Thanks, I got it.
@haotian-liu @Tomato1101 @aprilehannibal @abdul And Do you solved the above error (torch.nn.Conv2d returning an empty tensor) without bf16, For example, the V100 cannot support bf16.
@XipengY Maybe use fp16, or train on more V100 GPUs(like more than 16) with fp32.
@XipengY Maybe use fp16, or train on more V100 GPUs(like more than 16) with fp32.
@aprilehannibal Thanks for your guidance, I'll try later.
You can observe
self.patch_embedding
weights inyour_path/transformers/models/clip/modeling_clip.py
to potentially identify the error.@Tomato1101 Have you solve this error now?
Sorry, Not yet. I have successfully run llava in A100 without using deepspeed.
How do you train without deepspeed, can you show your training scripts and configuration?thanks
@XipengY Did you solve the above error that torch.nn.Conv2d
returning an empty tensor without bf16, I ran ZeRO-3 on 4*V100, and I used fp16. And the vision model's tensors are empty. ZeRO-2 seems to work but I got OOM problem.
When did you clone our code?
I cloned the code base after 5/1/23
Describe the issue
Issue: When I use deepspeed zero3 to pretrainning LLaVA-13B on 4 A100(40G),I got an error shows below. It seems like when model parallelism, the clip parameters changed. When I use zero2, pretraining stage can sucessfully run. Because I need to pretrain 13B on a smaller GPU like 16 A10(24G), I must to use zero3. @haotian-liu
Command:
Screenshots: