Describe the bug failed to find frozen {param} in named params

To Reproduce use accerate deepspeed to train flux. 1

accelerator = Accelerator() model, optimizer, data = accelerator.prepare(model, optimizer, data) device_map = {} model = accelerate.dispatch_model(model, device_map=device_map) accelerator.save_state(save_path)

When I use accelerate.dispatch_model after accelerator.prepare, there will be an error when saving the model

Traceback (most recent call last): File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in main() File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main accelerator.save_state(save_path) File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs) File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint self._save_checkpoint(save_dir, File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint frozen_param_shapes=self._get_zero_frozen_param_attributes(self._get_param_shape_func) File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes raise ValueError(f"failed to find frozen {param} in named params") ValueError: failed to find frozen Parameter containing: tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104], [-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081], [ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135], ..., [-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645], [-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233], [-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]], device='cuda:1', dtype=torch.bfloat16) in named params rank0: Traceback (most recent call last): rank0: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in

rank0: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main

rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state rank0: model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs) rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint

rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint

rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes rank0: raise ValueError(f"failed to find frozen {param} in named params") rank0: ValueError: failed to find frozen Parameter containing: rank0: tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104], rank0: [-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081], rank0: [ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],

api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_flux_lora_deepspeed.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-11_15:00:14 host : yq01-sys-hic-k8s-v100-box-a225-0075.yq01.baidu.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 30158) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html By reading the source code, it was found that the reason for the error is that executing accelerator.prepare will result in a mapping of params sent to name，That's the code below： self.param_names = {param: name for name, param in model.named_parameters()} But when I execute accelerate.dispatch_madel, it will change the address of the model, causing an error when saving the model and using param_name to find the name value corresponding to param Here are the error functions： def _get_zero_frozen_param_attributes(self, attr_func): frozen_param_fragments = OrderedDict() for param in self.module.parameters(): if param.requires_grad: continue if param not in self.param_names: raise ValueError(f"failed to find frozen {param} in named params") name = self.param_names[param] frozen_param_fragments[name] = attr_func(param) return frozen_param_fragments Why can't this function be written as follows? This way, writing code is more concise and can also solve this problem def _get_zero_frozen_param_attributes(self, attr_func): frozen_param_fragments = OrderedDict() for name, param in self.module.named_parameters(): if param.requires_grad: continue frozen_param_fragments[name] = attr_func(param) return frozen_param_fragments

microsoft / DeepSpeed

[BUG] failed to find frozen {param} in named params #6620

train_flux_lora_deepspeed.py FAILED