microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.05k stars 4.06k forks source link

[BUG] failed to find frozen {param} in named params #6620

Open ssklzx opened 20 hours ago

ssklzx commented 20 hours ago

Describe the bug failed to find frozen {param} in named params

To Reproduce use accerate deepspeed to train flux. 1

accelerator = Accelerator() model, optimizer, data = accelerator.prepare(model, optimizer, data) device_map = {} model = accelerate.dispatch_model(model, device_map=device_map) accelerator.save_state(save_path)

When I use accelerate.dispatch_model after accelerator.prepare, there will be an error when saving the model

Traceback (most recent call last): File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in main() File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main accelerator.save_state(save_path) File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs) File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint self._save_checkpoint(save_dir, File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint frozen_param_shapes=self._get_zero_frozen_param_attributes(self._get_param_shape_func) File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes raise ValueError(f"failed to find frozen {param} in named params") ValueError: failed to find frozen Parameter containing: tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104], [-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081], [ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135], ..., [-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645], [-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233], [-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]], device='cuda:1', dtype=torch.bfloat16) in named params rank0: Traceback (most recent call last): rank0: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in

rank0: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main

rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state rank0: model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs) rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint

rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint

rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes rank0: raise ValueError(f"failed to find frozen {param} in named params") rank0: ValueError: failed to find frozen Parameter containing: rank0: tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104], rank0: [-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081], rank0: [ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],

rank0: [-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645], rank0: [-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233], rank0: [-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]], rank0: device='cuda:1', dtype=torch.bfloat16) in named params wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /root/paddlejob/workspace/env_run/x-flux/wandb/offline-run-20241011_145931-2vi5cs6v wandb: Find logs at: wandb/offline-run-20241011_145931-2vi5cs6v/logs E1011 15:00:14.591000 140520213088000 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 30158) of binary: /root/paddlejob/workspace/env_run/xflux_train_python3/bin/python3 Traceback (most recent call last): File "/root/paddlejob/workspace/env_run/x-flux/../xflux_train_python3/bin/accelerate", line 8, in sys.exit(main()) File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main args.func(args) File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1091, in launch_command deepspeed_launcher(args) File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 787, in deepspeed_launcher distrib_run.run(args) File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_flux_lora_deepspeed.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-10-11_15:00:14 host : yq01-sys-hic-k8s-v100-box-a225-0075.yq01.baidu.com rank : 0 (local_rank: 0) exitcode : 1 (pid: 30158) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html By reading the source code, it was found that the reason for the error is that executing accelerator.prepare will result in a mapping of params sent to name,That's the code below: self.param_names = {param: name for name, param in model.named_parameters()} But when I execute accelerate.dispatch_madel, it will change the address of the model, causing an error when saving the model and using param_name to find the name value corresponding to param Here are the error functions: def _get_zero_frozen_param_attributes(self, attr_func): frozen_param_fragments = OrderedDict() for param in self.module.parameters(): if param.requires_grad: continue if param not in self.param_names: raise ValueError(f"failed to find frozen {param} in named params") name = self.param_names[param] frozen_param_fragments[name] = attr_func(param) return frozen_param_fragments Why can't this function be written as follows? This way, writing code is more concise and can also solve this problem def _get_zero_frozen_param_attributes(self, attr_func): frozen_param_fragments = OrderedDict() for name, param in self.module.named_parameters(): if param.requires_grad: continue frozen_param_fragments[name] = attr_func(param) return frozen_param_fragments
jomayeri commented 10 hours ago

@ssklzx Please make a PR with this change and we will review it.