huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.97k stars 970 forks source link

How could I convert ZeRO-0 deepspeed weights into fp32 model checkpoint? #3210

Open liming-ai opened 2 weeks ago

liming-ai commented 2 weeks ago

This github issue is also open in DeepSpeed Repo

I use DeepSpeed ZeRO-0 to train a diffusion model with multi-node GPU, with huggingface diffusers training scripts, the accelerate config is set to:

deepspeed_config:
  deepspeed_hostfile: /opt/tiger/hostfile
  deepspeed_multinode_launcher: pdsh
  gradient_clipping: auto
  offload_optimizer_device: none
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 0
distributed_type: DEEPSPEED

When I tried to convert the deepspeed weights to fp32 checkpoint with zero_to_fp32.py, there is an error:

Traceback (most recent call last):
  File "code/diffusers/tools/zero_to_fp32.py", line 601, in <module>
    convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
  File "code/diffusers/tools/zero_to_fp32.py", line 536, in convert_zero_checkpoint_to_fp32_state_dict
    state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
  File "code/diffusers/tools/zero_to_fp32.py", line 521, in get_fp32_state_dict_from_zero_checkpoint
    return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
  File "code/diffusers/tools/zero_to_fp32.py", line 205, in _get_fp32_state_dict_from_zero_checkpoint
    zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
  File "code/diffusers/tools/zero_to_fp32.py", line 153, in parse_optim_states
    raise ValueError(f"{files[0]} is not a zero checkpoint")
ValueError: work_dirs/checkpoint-2000/pytorch_model/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt is not a zero checkpoint

My work_dir tree structure is:

work_dirs/checkpoint-2000
├── latest
├── pytorch_model
│   ├── bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_10_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_11_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_12_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_13_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_14_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_15_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_1_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_2_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_3_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_4_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_5_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_6_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_7_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_8_mp_rank_00_optim_states.pt
│   ├── bf16_zero_pp_rank_9_mp_rank_00_optim_states.pt
│   └── mp_rank_00_model_states.pt
├── random_states_0.pkl
├── random_states_10.pkl
├── random_states_11.pkl
├── random_states_12.pkl
├── random_states_13.pkl
├── random_states_14.pkl
├── random_states_15.pkl
├── random_states_1.pkl
├── random_states_2.pkl
├── random_states_3.pkl
├── random_states_4.pkl
├── random_states_5.pkl
├── random_states_6.pkl
├── random_states_7.pkl
├── random_states_8.pkl
├── random_states_9.pkl
├── scheduler.bin