Describe the bug
failed to find frozen {param} in named params
To Reproduce
use accerate deepspeed to train flux. 1
accelerator = Accelerator()
model, optimizer, data = accelerator.prepare(model, optimizer, data)
device_map = {}
model = accelerate.dispatch_model(model, device_map=device_map)
accelerator.save_state(save_path)
When I use accelerate.dispatch_model after accelerator.prepare, there will be an error when saving the model
Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in
main()
File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main
accelerator.save_state(save_path)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state
model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint
self._save_checkpoint(save_dir,
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint
frozen_param_shapes=self._get_zero_frozen_param_attributes(self._get_param_shape_func)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes
raise ValueError(f"failed to find frozen {param} in named params")
ValueError: failed to find frozen Parameter containing:
tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104],
[-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081],
[ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],
...,
[-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645],
[-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233],
[-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]],
device='cuda:1', dtype=torch.bfloat16) in named params
rank0: Traceback (most recent call last):
rank0: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in
rank0: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main
rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state
rank0: model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs)
rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint
rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint
rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes
rank0: raise ValueError(f"failed to find frozen {param} in named params")
rank0: ValueError: failed to find frozen Parameter containing:
rank0: tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104],
rank0: [-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081],
rank0: [ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],
rank0: [-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645],
rank0: [-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233],
rank0: [-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]],
rank0: device='cuda:1', dtype=torch.bfloat16) in named params
wandb:
wandb: You can sync this run to the cloud by running:
wandb: wandb sync /root/paddlejob/workspace/env_run/x-flux/wandb/offline-run-20241011_145931-2vi5cs6v
wandb: Find logs at: wandb/offline-run-20241011_145931-2vi5cs6v/logs
E1011 15:00:14.591000 140520213088000 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 30158) of binary: /root/paddlejob/workspace/env_run/xflux_train_python3/bin/python3
Traceback (most recent call last):
File "/root/paddlejob/workspace/env_run/x-flux/../xflux_train_python3/bin/accelerate", line 8, in
sys.exit(main())
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1091, in launch_command
deepspeed_launcher(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 787, in deepspeed_launcher
distrib_run.run(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_flux_lora_deepspeed.py FAILED
Failures:
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-10-11_15:00:14
host : yq01-sys-hic-k8s-v100-box-a225-0075.yq01.baidu.com
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 30158)
error_file:
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
By reading the source code, it was found that the reason for the error is that executing accelerator.prepare will result in a mapping of params sent to name,That's the code below:
self.param_names = {param: name for name, param in model.named_parameters()}
But when I execute accelerate.dispatch_madel, it will change the address of the model, causing an error when saving the model and using param_name to find the name value corresponding to param
Here are the error functions:
def _get_zero_frozen_param_attributes(self, attr_func):
frozen_param_fragments = OrderedDict()
for param in self.module.parameters():
if param.requires_grad:
continue
if param not in self.param_names:
raise ValueError(f"failed to find frozen {param} in named params")
name = self.param_names[param]
frozen_param_fragments[name] = attr_func(param)
return frozen_param_fragments
Why can't this function be written as follows? This way, writing code is more concise and can also solve this problem
def _get_zero_frozen_param_attributes(self, attr_func):
frozen_param_fragments = OrderedDict()
for name, param in self.module.named_parameters():
if param.requires_grad:
continue
frozen_param_fragments[name] = attr_func(param)
return frozen_param_fragments
Describe the bug failed to find frozen {param} in named params
To Reproduce use accerate deepspeed to train flux. 1
accelerator = Accelerator() model, optimizer, data = accelerator.prepare(model, optimizer, data) device_map = {} model = accelerate.dispatch_model(model, device_map=device_map) accelerator.save_state(save_path)
When I use accelerate.dispatch_model after accelerator.prepare, there will be an error when saving the model
Traceback (most recent call last): File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in
main()
File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main
accelerator.save_state(save_path)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state
model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint
self._save_checkpoint(save_dir,
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint
frozen_param_shapes=self._get_zero_frozen_param_attributes(self._get_param_shape_func)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes
raise ValueError(f"failed to find frozen {param} in named params")
ValueError: failed to find frozen Parameter containing:
tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104],
[-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081],
[ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],
...,
[-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645],
[-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233],
[-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]],
device='cuda:1', dtype=torch.bfloat16) in named params
rank0: Traceback (most recent call last):
rank0: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 391, in
rank0: File "/root/paddlejob/workspace/env_run/x-flux/train_flux_lora_deepspeed.py", line 311, in main
rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/accelerator.py", line 2952, in save_state rank0: model.save_checkpoint(output_dir, ckpt_id, **save_model_func_kwargs) rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3164, in save_checkpoint
rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3365, in _save_checkpoint
rank0: File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 3432, in _get_zero_frozen_param_attributes rank0: raise ValueError(f"failed to find frozen {param} in named params") rank0: ValueError: failed to find frozen Parameter containing: rank0: tensor([[-0.0520, 0.0588, 0.0430, ..., -0.0391, 0.0083, -0.0104], rank0: [-0.0320, 0.0408, 0.0112, ..., -0.0121, -0.0264, -0.0081], rank0: [ 0.0055, 0.0493, 0.0488, ..., -0.0088, -0.0187, 0.0135],
rank0: [-0.0957, 0.0220, 0.0087, ..., -0.0021, -0.0474, -0.0645], rank0: [-0.0742, -0.0574, 0.0247, ..., -0.0396, 0.0041, 0.0233], rank0: [-0.0225, 0.0225, 0.0299, ..., 0.0131, -0.0094, -0.0386]], rank0: device='cuda:1', dtype=torch.bfloat16) in named params wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /root/paddlejob/workspace/env_run/x-flux/wandb/offline-run-20241011_145931-2vi5cs6v wandb: Find logs at: wandb/offline-run-20241011_145931-2vi5cs6v/logs E1011 15:00:14.591000 140520213088000 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 30158) of binary: /root/paddlejob/workspace/env_run/xflux_train_python3/bin/python3 Traceback (most recent call last): File "/root/paddlejob/workspace/env_run/x-flux/../xflux_train_python3/bin/accelerate", line 8, in
sys.exit(main())
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 50, in main
args.func(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1091, in launch_command
deepspeed_launcher(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 787, in deepspeed_launcher
distrib_run.run(args)
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/paddlejob/workspace/env_run/xflux_train_python3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
train_flux_lora_deepspeed.py FAILED
Failures: