huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.97k stars 27k forks source link

Failed to load universal_checkpoint with deepspeed integreation #33157

Open huyiwen opened 2 months ago

huyiwen commented 2 months ago

System Info

Who can help?

@muellerzr

Information

Tasks

Reproduction

The Universal Checkpointing feature allows loading with different world sizes. However, when using the Hugging Face Trainer, the loading of the converted universal checkpoint fails.

The failure seems to be due to HfTrainerDeepSpeedConfig not correctly handling the "load_universal_checkpoint": true or "universal_checkpoint": true arguments in the DeepSpeed configuration. Consequently, the load_universal_checkpoint function returns False.

Related Issues:

Expected behavior

Universal checkpoint should be loaded correctly.

huyiwen commented 2 months ago

Here's my deepspeed config json:

{
  "bf16": {
    "enabled": "auto"
  },
  "zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 1e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 1e8,
    "contiguous_gradients": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 16,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false,
  "activation_checkpointing": {
    "partition_activations": false,
    "cpu_checkpointing": true,
    "contiguous_memory_optimization": false,
    "number_checkpoints": null,
    "synchronize_checkpoint_boundary": false,
    "profile": false
  },
  "optimizer": {
    "type": "AdamW",
    "params": {
      "lr": "auto",
      "betas": "auto",
      "eps": "auto",
      "weight_decay": "auto"
    }
  },
  "no_pipeline_parallel": true,
  "load_universal_checkpoint": true
}
huyiwen commented 2 months ago

Another related issue: https://github.com/microsoft/DeepSpeed/issues/5405

huyiwen commented 2 months ago

Hello @ArthurZucker and @muellerz. I am able to create a pull request to address the issue. I have resolved the issue by deleting all the “rng_state” files as it had a different world size.

Before I start with the PR, I would like to ensure that NOT loading these “rng_state” files does not have any side-effects.

huyiwen commented 2 months ago

We can skip these rng_state and add a warning.

ArthurZucker commented 1 month ago

Sure feel free to open a PR!