8卡A800 deepspeed stage3全参sft qwen2vl-7b卡住，stage2正常训练

gujiacheng commented 3 weeks ago

Reminder

[X] I have read the README and searched the existing issues.

System Info

llamafactory 0.8.4.dev0 transformers 4.45.0 deepspeed 0.14.4

Reproduction

启动命令： torchrun --nproc_per_node 8 src/train.py \ --deepspeed examples/deepspeed/ds_z3_config.json \ --stage sft \ --do_train \ --model_name_or_path /workspace/qwen_vl/qw2_vl_7b_model \ --dataset aiot_sft_data \ --template qwen2_vl \ --finetuning_type full \ --output_dir saves/qwen2_vl-7b/full/sft/0911 \ --overwrite_cache \ --overwrite_output_dir \ --warmup_steps 100 \ --weight_decay 0.1 \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 8 \ --ddp_timeout 180 \ --learning_rate 5e-6 \ --lr_scheduler_type cosine \ --logging_steps 1 \ --cutoff_len 4096 \ --save_steps 1000 \ --plot_loss \ --num_train_epochs 3 \ --bf16

deepspeed配置： { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }

Expected behavior

GPU使用率100%

卡在这里

一段时间后超时：

Others

有类似现象的issue，但是里面的方案解决不了 https://github.com/hiyouga/LLaMA-Factory/issues/3147

1316829544 commented 2 weeks ago

同样的问题，zero3会卡住，zero2的话，模型初始化，CPU2015个G内存会爆掉，中止程序，zero3时模式初始化内存只占用50个G

1316829544 commented 2 weeks ago

![Uploading PixPin_2024-11-11_11-35-23.png…]()

1316829544 commented 2 weeks ago

1316829544 commented 2 weeks ago

1316829544 commented 2 weeks ago

我用的是四机32卡，配置如下：

model

model_name_or_path: /data/nfs/m01096/project/Qwen2-VL-72B-Instruct

method

stage: sft do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z3_config.json

dataset

dataset: mllm_demo,identity template: qwen2_vl cutoff_len: 4 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16

output

output_dir: saves/qwen2_vl-7b/full/sft logging_steps: 1 save_steps: 500 plot_loss: true overwrite_output_dir: true include_tokens_per_second: true include_num_input_tokens_seen: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 1 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000

eval

val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500

train_from_scratch: true

1316829544 commented 2 weeks ago

使用Z2的话，模型原始层数为80，我改为10层，就能跑通，什么问题

Amb10tion commented 3 days ago

请问该问题解决了吗？

hiyouga / LLaMA-Factory