huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
24.21k stars 5k forks source link

EMAModel causes InstructP2P parallel fine-tuning error #8514

Open liming-ai opened 1 month ago

liming-ai commented 1 month ago

Describe the bug

[The official given example] works well if I set args.pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5".

When I try to fine-tune instructpix2pix with args.pretrained_model_name_or_path="timbrooks/instruct-pix2pix"(https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py), this error will happen:

[2024-06-13 14:43:22,518] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 3111170) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 1073, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.9/dist-packages/accelerate/commands/launch.py", line 718, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.9/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

I use pdb to analyze the code line-by-line, and find this error is caused by the initialization of EMAModel:

if args.use_ema:
       ema_unet = EMAModel(unet.parameters(), model_cls=UNet2DConditionModel, model_config=unet.config)

I have no idea how to fix this issue since there is no specific error.

Reproduction

Please train instructpix2pix with [the official given example], with multi-gpu setting.

My accelerate config is:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU

downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Logs

No response

System Info

Who can help?

No response

sayakpaul commented 1 month ago

Does it happen on latest versions of PyTorch (2.3 for example)?