[2024-06-13 14:43:22,518] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 3111170) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 1073, in launch_command
multi_gpu_launcher(args)
File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
distrib_run.run(args)
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
I use pdb to analyze the code line-by-line, and find this error is caused by the initialization of EMAModel:
if args.use_ema:
ema_unet = EMAModel(unet.parameters(), model_cls=UNet2DConditionModel, model_config=unet.config)
I have no idea how to fix this issue since there is no specific error.
Reproduction
Please train instructpix2pix with [the official given example], with multi-gpu setting.
Describe the bug
[The official given example] works well if I set
args.pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5"
.When I try to fine-tune instructpix2pix with
args.pretrained_model_name_or_path="timbrooks/instruct-pix2pix"
(https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py), this error will happen:I use
pdb
to analyze the code line-by-line, and find this error is caused by the initialization of EMAModel:I have no idea how to fix this issue since there is no specific error.
Reproduction
Please train instructpix2pix with [the official given example], with multi-gpu setting.
My accelerate config is:
Logs
No response
System Info
Who can help?
No response