[BUG] Universal checkpoint incompatibility with HF Trainer

huyiwen commented 2 months ago

Describe the bug I'm currently using the HF Trainer for training, with the HF learning rate scheduler and DeepSpeed optimizer. I've encountered an issue with loading universal checkpoints. The HF Trainer does not natively support loading universal checkpoints. Is there a way to load universal checkpoints while using the HF Trainer? If not, is it necessary to switch to DeepSpeed for training?

I managed to load the universal checkpoint by forcing load_universal_checkpoint to return True. However, the training loop exits silently after the first iteration.

Relate issue: https://github.com/microsoft/DeepSpeed/issues/5430

@xylian86

Expected behavior I want load universal checkpoint with HF Trainer.

ds_report output

[2024-09-01 22:57:13,255] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
 [WARNING]  FP Quantizer is using an untested triton version (3.0.0), only 2.3.0 and 2.3.1 are known to be compatible with these kernels
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
 [WARNING]  gds requires the dev libaio .so object and headers but these were not found.
 [WARNING]  gds: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
gds .................... [NO] ....... [NO]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/textbox/miniconda3/envs/hyw/lib/python3.8/site-packages/torch']
torch version .................... 2.4.0+cu121
deepspeed install path ........... ['/home/textbox/miniconda3/envs/hyw/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.15.0, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.4
deepspeed wheel compiled w. ...... torch 2.4, cuda 12.1
shared memory (/dev/shm) size .... 251.64 GB

System info (please complete the following information):

OS: Ubuntu 18.04
GPU count and types: one machine with x8 A800s each
Python version: 3.8.18

Launcher context Launch experiment with torchrun

Docker context Not using docker

xylian86 commented 2 months ago

@huyiwen Thank you for reporting the issue. You can load universal checkpoints while using the Hugging Face Trainer with DeepSpeed as the backend. Please note that you need to use the latest version of DeepSpeed.

And is it possible to share the stacktrace of the error?

huyiwen commented 2 months ago

Thank you for helping me answer my question. Yes, I'm using the latest versions of DeepSpeed (0.15.0) and Transformers (4.44.0).

Unfortunately, I didn't get any backtrace.

cageyoko commented 2 months ago

I don't met same problem, when I use HF resume the universal checkpoint, But there is a stupid operate that I need change load_universal_checkpoint() to "true" by my hand. When I resume my checkpoint, I change the ds_config.json, the self._config.load_universal_checkpoint is still "False"

@huyiwen感谢您报告此问题。您可以在使用以 DeepSpeed 为后端的 Hugging Face Trainer 时加载通用检查点。请注意，您需要使用最新版本的 DeepSpeed。

并且可以分享错误的堆栈跟踪吗？

huyiwen commented 2 months ago

I don't met same problem, when I use HF resume the universal checkpoint, But there is a stupid operate that I need change load_universal_checkpoint() to "true" by my hand. When I resume my checkpoint, I change the ds_config.json, the self._config.load_universal_checkpoint is still "False"

@huyiwen感谢您报告此问题。您可以在使用以 DeepSpeed 为后端的 Hugging Face Trainer 时加载通用检查点。请注意，您需要使用最新版本的 DeepSpeed。

并且可以分享错误的堆栈跟踪吗？

Thanks for sharing your results.

I did the same change but the issue still exists.

huyiwen commented 2 months ago

Here's my launch script:

torchrun --nproc_per_node 2 \
    --nnodes 1 \
    --node_rank 0 \
    --master_addr "183.174.228.167" \
    --master_port=${MASTER_PORT} \
    train.py \
    --model_name_or_path ${MODEL_PATH} \
    --data_path ${DATA_PATH} \
    --output_dir ${OUTPUT_DIR} \
    --bf16 True \
    --num_train_epochs $STAGE \
    --model_max_length $MODEL_MAX_LENGTH \
    --per_device_train_batch_size $PER_DEVICE_TRAIN_BATCH_SIZE \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
    --eval_strategy "no" \
    --save_strategy "steps" \
    --save_steps 4000 \
    --save_total_limit 3 \
    --learning_rate $LEARNING_RATE \
    --warmup_ratio $WARMUP_RATIO \
    --weight_decay $WEIGHT_DECAY \
    --logging_steps 2 \
    --deepspeed ${DEEPSPEED_CONFIG_PATH} \
    --gradient_checkpointing True \
    --deepspeed_gradient_checkpointing False \
    --report_to none \
    --tf32 True \
    --lr_scheduler_type "linear" \
    --flash_attention \
    --use_wsd \
    --log_dir $LOG_DIR \
    --profile False \
    --torch_compile \
    --torch_empty_cache_steps 1000 \
    --max_grad_norm 1 \
    --hyper_param_decay_rate 0 \
    --logging_dir ${LOG_DIR} \
    --ddp_timeout 3600 \
    --start_lambda $START_LAMBDA \
    --end_lambda $END_LAMBDA \
    --start_global_step $START_GLOBAL_STEP \
    --end_global_step $END_GLOBAL_STEP \
    --resume_from_checkpoint $MODEL_PATH

{
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 1e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 1e8,
    "contiguous_gradients": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 16,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false,
  "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "no_pipeline_parallel": true,
  "universal_checkpoint": true
}

huyiwen commented 1 month ago

@xylian86 I've fixed the problem by deleting the rng_state files! Thanks for helping! Maybe you can add the solutions to the manual, since this requires a lot of exploration.

huyiwen commented 1 month ago

Here's my solution:

Step 1: Get the universal checkpoint following the tutorial. Step 2: Modify the source code of deepspeed load_universal_checkpoint to force load universal checkpoint. Step 3: Delete the rng_state.pth files in HF trainer checkpoint.

microsoft / DeepSpeed

[BUG] Universal checkpoint incompatibility with HF Trainer #6470