Closed huyiwen closed 1 month ago
@huyiwen Thank you for reporting the issue. You can load universal checkpoints while using the Hugging Face Trainer with DeepSpeed as the backend. Please note that you need to use the latest version of DeepSpeed.
And is it possible to share the stacktrace of the error?
Thank you for helping me answer my question. Yes, I'm using the latest versions of DeepSpeed (0.15.0) and Transformers (4.44.0).
Unfortunately, I didn't get any backtrace.
I don't met same problem, when I use HF resume the universal checkpoint, But there is a stupid operate that I need change load_universal_checkpoint() to "true" by my hand. When I resume my checkpoint, I change the ds_config.json, the self._config.load_universal_checkpoint is still "False"
@huyiwen感谢您报告此问题。您可以在使用以 DeepSpeed 为后端的 Hugging Face Trainer 时加载通用检查点。请注意,您需要使用最新版本的 DeepSpeed。
并且可以分享错误的堆栈跟踪吗?
I don't met same problem, when I use HF resume the universal checkpoint, But there is a stupid operate that I need change load_universal_checkpoint() to "true" by my hand. When I resume my checkpoint, I change the ds_config.json, the self._config.load_universal_checkpoint is still "False"
@huyiwen感谢您报告此问题。您可以在使用以 DeepSpeed 为后端的 Hugging Face Trainer 时加载通用检查点。请注意,您需要使用最新版本的 DeepSpeed。
并且可以分享错误的堆栈跟踪吗?
Thanks for sharing your results.
I did the same change but the issue still exists.
Here's my launch script:
torchrun --nproc_per_node 2 \
--nnodes 1 \
--node_rank 0 \
--master_addr "183.174.228.167" \
--master_port=${MASTER_PORT} \
train.py \
--model_name_or_path ${MODEL_PATH} \
--data_path ${DATA_PATH} \
--output_dir ${OUTPUT_DIR} \
--bf16 True \
--num_train_epochs $STAGE \
--model_max_length $MODEL_MAX_LENGTH \
--per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
--eval_strategy "no" \
--save_strategy "steps" \
--save_steps 4000 \
--save_total_limit 3 \
--learning_rate $LEARNING_RATE \
--warmup_ratio $WARMUP_RATIO \
--weight_decay $WEIGHT_DECAY \
--logging_steps 2 \
--deepspeed ${DEEPSPEED_CONFIG_PATH} \
--gradient_checkpointing True \
--deepspeed_gradient_checkpointing False \
--report_to none \
--tf32 True \
--lr_scheduler_type "linear" \
--flash_attention \
--use_wsd \
--log_dir $LOG_DIR \
--profile False \
--torch_compile \
--torch_empty_cache_steps 1000 \
--max_grad_norm 1 \
--hyper_param_decay_rate 0 \
--logging_dir ${LOG_DIR} \
--ddp_timeout 3600 \
--start_lambda $START_LAMBDA \
--end_lambda $END_LAMBDA \
--start_global_step $START_GLOBAL_STEP \
--end_global_step $END_GLOBAL_STEP \
--resume_from_checkpoint $MODEL_PATH
{
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 16,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false,
"activation_checkpointing": {
"partition_activations": false,
"cpu_checkpointing": true,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"no_pipeline_parallel": true,
"universal_checkpoint": true
}
@xylian86 I've fixed the problem by deleting the rng_state files! Thanks for helping! Maybe you can add the solutions to the manual, since this requires a lot of exploration.
Here's my solution:
Step 1: Get the universal checkpoint following the tutorial. Step 2: Modify the source code of deepspeed load_universal_checkpoint to force load universal checkpoint. Step 3: Delete the rng_state.pth files in HF trainer checkpoint.
Describe the bug I'm currently using the HF Trainer for training, with the HF learning rate scheduler and DeepSpeed optimizer. I've encountered an issue with loading universal checkpoints. The HF Trainer does not natively support loading universal checkpoints. Is there a way to load universal checkpoints while using the HF Trainer? If not, is it necessary to switch to DeepSpeed for training?
I managed to load the universal checkpoint by forcing
load_universal_checkpoint
to return True. However, the training loop exits silently after the first iteration.Relate issue: https://github.com/microsoft/DeepSpeed/issues/5430
@xylian86
Expected behavior I want load universal checkpoint with HF Trainer.
ds_report output
System info (please complete the following information):
Launcher context Launch experiment with
torchrun
Docker context Not using docker