Closed gujiacheng closed 3 weeks ago
同样的问题,zero3会卡住,zero2的话,模型初始化,CPU2015个G内存会爆掉,中止程序,zero3时模式初始化内存只占用50个G
![Uploading PixPin_2024-11-11_11-35-23.png…]()
我用的是四机32卡,配置如下:
model_name_or_path: /data/nfs/m01096/project/Qwen2-VL-72B-Instruct
stage: sft do_train: true finetuning_type: full deepspeed: examples/deepspeed/ds_z3_config.json
dataset: mllm_demo,identity template: qwen2_vl cutoff_len: 4 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16
output_dir: saves/qwen2_vl-7b/full/sft logging_steps: 1 save_steps: 500 plot_loss: true overwrite_output_dir: true include_tokens_per_second: true include_num_input_tokens_seen: true
per_device_train_batch_size: 1 gradient_accumulation_steps: 1 learning_rate: 1.0e-5 num_train_epochs: 3.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000
val_size: 0.1 per_device_eval_batch_size: 1 eval_strategy: steps eval_steps: 500
train_from_scratch: true
使用Z2的话,模型原始层数为80,我改为10层,就能跑通,什么问题
请问该问题解决了吗?
Reminder
System Info
llamafactory 0.8.4.dev0 transformers 4.45.0 deepspeed 0.14.4
Reproduction
启动命令: torchrun --nproc_per_node 8 src/train.py \ --deepspeed examples/deepspeed/ds_z3_config.json \ --stage sft \ --do_train \ --model_name_or_path /workspace/qwen_vl/qw2_vl_7b_model \ --dataset aiot_sft_data \ --template qwen2_vl \ --finetuning_type full \ --output_dir saves/qwen2_vl-7b/full/sft/0911 \ --overwrite_cache \ --overwrite_output_dir \ --warmup_steps 100 \ --weight_decay 0.1 \ --per_device_train_batch_size 4 \ --gradient_accumulation_steps 8 \ --ddp_timeout 180 \ --learning_rate 5e-6 \ --lr_scheduler_type cosine \ --logging_steps 1 \ --cutoff_len 4096 \ --save_steps 1000 \ --plot_loss \ --num_train_epochs 3 \ --bf16
deepspeed配置: { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "loss_scale_window": 1000, "initial_scale_power": 16, "hysteresis": 2, "min_loss_scale": 1 }, "bf16": { "enabled": "auto" }, "zero_optimization": { "stage": 3, "overlap_comm": true, "contiguous_gradients": true, "sub_group_size": 1e9, "reduce_bucket_size": "auto", "stage3_prefetch_bucket_size": "auto", "stage3_param_persistence_threshold": "auto", "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_gather_16bit_weights_on_model_save": true } }
Expected behavior
GPU使用率100%
卡在这里
一段时间后超时:
Others
有类似现象的issue,但是里面的方案解决不了 https://github.com/hiyouga/LLaMA-Factory/issues/3147