EMAModel causes InstructP2P parallel fine-tuning error

Describe the bug

[The official given example] works well if I set args.pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5".

When I try to fine-tune instructpix2pix with args.pretrained_model_name_or_path="timbrooks/instruct-pix2pix"(https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py), this error will happen:

[2024-06-13 14:43:22,518] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 3111170) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I use pdb to analyze the code line-by-line, and find this error is caused by the initialization of EMAModel:

if args.use_ema:
       ema_unet = EMAModel(unet.parameters(), model_cls=UNet2DConditionModel, model_config=unet.config)

I have no idea how to fix this issue since there is no specific error.

Reproduction

Please train instructpix2pix with [the official given example], with multi-gpu setting.

My accelerate config is:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU

downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Logs

No response

System Info

🤗 Diffusers version: 0.28.0.dev0
Platform: Linux-5.4.143.bsk.8-amd64-x86_64-with-glibc2.31
Running on a notebook?: No
Running on Google Colab?: No
Python version: 3.9.2
PyTorch version (GPU?): 2.1.0+cu121 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.23.1
Transformers version: 4.41.1
Accelerate version: 0.30.1
PEFT version: not installed
Bitsandbytes version: not installed
Safetensors version: 0.4.3
xFormers version: not installed
Accelerator: NVIDIA A100-SXM4-80GB, 81251 MiB NVIDIA A100-SXM4-80GB, 81251 MiB NVIDIA A100-SXM4-80GB, 81251 MiB NVIDIA A100-SXM4-80GB, 81251 MiB NVIDIA A100-SXM4-80GB, 81251 MiB NVIDIA A100-SXM4-80GB, 81251 MiB NVIDIA A100-SXM4-80GB, 81251 MiB NVIDIA A100-SXM4-80GB, 81251 MiB VRAM
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?