Open DietDietDiet opened 1 year ago
Issue: I am finetuning llava1.5-7B on 8 * A100 40G, and modified bs & accumulation steps accordingly. The estimated training time is approx. 24h. What could go wrong?
Env: cuda11.7 torch2.0.1 flash-attn 2.3.2
Command:
PASTE THE COMMANDS HERE.
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed llava/train/train_mem.py \ --deepspeed ./scripts/zero3.json \ --model_name_or_path lmsys/vicuna-7b-v1.5 \ --version v1 \ --data_path data/llava_v1_5_mix665k.json.bak \ --image_folder ./playground/data \ --vision_tower openai/clip-vit-large-patch14-336 \ --pretrain_mm_mlp_adapter projectors/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/llava-v1.5-7b \ --num_train_epochs 1 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to wandb Log:
PASTE THE LOGS HERE.
Screenshots: You may attach screenshots if it better explains the issue.
For 7B model, you probably do not need to set the batch size to be so small, maybe 8x2 is sufficient.
Also, are your GPUs NVLinked?
Describe the issue
Issue: I am finetuning llava1.5-7B on 8 * A100 40G, and modified bs & accumulation steps accordingly. The estimated training time is approx. 24h. What could go wrong?
Env: cuda11.7 torch2.0.1 flash-attn 2.3.2
Command:
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 deepspeed llava/train/train_mem.py \ --deepspeed ./scripts/zero3.json \ --model_name_or_path lmsys/vicuna-7b-v1.5 \ --version v1 \ --data_path data/llava_v1_5_mix665k.json.bak \ --image_folder ./playground/data \ --vision_tower openai/clip-vit-large-patch14-336 \ --pretrain_mm_mlp_adapter projectors/llava-v1.5-mlp2x-336px-pretrain-vicuna-7b-v1.5/mm_projector.bin \ --mm_projector_type mlp2x_gelu \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --image_aspect_ratio pad \ --group_by_modality_length True \ --bf16 True \ --output_dir ./checkpoints/llava-v1.5-7b \ --num_train_epochs 1 \ --per_device_train_batch_size 4 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 4 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 50000 \ --save_total_limit 1 \ --learning_rate 2e-5 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 8 \ --lazy_preprocess True \ --report_to wandb Log:
Screenshots: You may attach screenshots if it better explains the issue.