tianweiy / DMD2

(NeurIPS 2024 Oral 🔥) Improved Distribution Matching Distillation for Fast Image Synthesis
Other
526 stars 28 forks source link

sdv15 train with torchrun successful, but sdxl train with fsdp failed. #18

Open conquer-pan opened 6 months ago

conquer-pan commented 6 months ago

I successfully trained on sdv15, which uses torchrun. But an error was reported in SDXL, which uses fsdp.

error: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26827 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26828 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 26830 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 26829) of binary: /home/work/miniconda3/envs/dmd2/bin/python

My fsdp_sdxl.yaml is: compute_environment: LOCAL_MACHINE debug: true distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: SIZE_BASED_WRAP fsdp_backward_prefetch_policy: BACKWARD_PRE fsdp_forward_prefetch: false fsdp_min_num_params: 3000 fsdp_offload_params: false fsdp_sharding_strategy: 1 fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false machine_rank: 0 main_process_ip: 10.38.16.110 main_process_port: 2345 main_training_function: main mixed_precision: 'fp16' num_machines: 1 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

Please tell me what went wrong.!!!