🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
When I tried to convert the deepspeed weights to fp32 checkpoint with zero_to_fp32.py, there is an error:
Traceback (most recent call last):
File "code/diffusers/tools/zero_to_fp32.py", line 601, in <module>
convert_zero_checkpoint_to_fp32_state_dict(args.checkpoint_dir,
File "code/diffusers/tools/zero_to_fp32.py", line 536, in convert_zero_checkpoint_to_fp32_state_dict
state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir, tag, exclude_frozen_parameters)
File "code/diffusers/tools/zero_to_fp32.py", line 521, in get_fp32_state_dict_from_zero_checkpoint
return _get_fp32_state_dict_from_zero_checkpoint(ds_checkpoint_dir, exclude_frozen_parameters)
File "code/diffusers/tools/zero_to_fp32.py", line 205, in _get_fp32_state_dict_from_zero_checkpoint
zero_stage, world_size, fp32_flat_groups = parse_optim_states(optim_files, ds_checkpoint_dir)
File "code/diffusers/tools/zero_to_fp32.py", line 153, in parse_optim_states
raise ValueError(f"{files[0]} is not a zero checkpoint")
ValueError: work_dirs/checkpoint-2000/pytorch_model/bf16_zero_pp_rank_0_mp_rank_00_optim_states.pt is not a zero checkpoint
This github issue is also open in DeepSpeed Repo
I use DeepSpeed ZeRO-0 to train a diffusion model with multi-node GPU, with huggingface diffusers training scripts, the accelerate config is set to:
When I tried to convert the deepspeed weights to fp32 checkpoint with zero_to_fp32.py, there is an error:
My work_dir tree structure is: