Open huyiwen opened 2 months ago
Here's my deepspeed config json:
{
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 1e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 1e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 16,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false,
"activation_checkpointing": {
"partition_activations": false,
"cpu_checkpointing": true,
"contiguous_memory_optimization": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"no_pipeline_parallel": true,
"load_universal_checkpoint": true
}
Another related issue: https://github.com/microsoft/DeepSpeed/issues/5405
Hello @ArthurZucker and @muellerz. I am able to create a pull request to address the issue. I have resolved the issue by deleting all the “rng_state” files as it had a different world size.
Before I start with the PR, I would like to ensure that NOT loading these “rng_state” files does not have any side-effects.
We can skip these rng_state and add a warning.
Sure feel free to open a PR!
System Info
transformers
version: 4.44.2Who can help?
@muellerzr
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
The Universal Checkpointing feature allows loading with different world sizes. However, when using the Hugging Face
Trainer
, the loading of the converted universal checkpoint fails.The failure seems to be due to
HfTrainerDeepSpeedConfig
not correctly handling the"load_universal_checkpoint": true
or"universal_checkpoint": true
arguments in the DeepSpeed configuration. Consequently, theload_universal_checkpoint
function returnsFalse
.Related Issues:
Expected behavior
Universal checkpoint should be loaded correctly.