huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.32k stars 5.42k forks source link

Error when setting num_single_layers=0 while training flux-controlnet on a multi-GPU server using a single GPU #9630

Open wangherr opened 1 month ago

wangherr commented 1 month ago

Describe the bug

While training flux-controlnet on a multi-GPU server and restricting the training to a single GPU, setting _num_singlelayers=0 leads to an error:

[rank0]: Parameter indices which did not receive grad for rank 0: 64 65 72 73 74 75

Reproduction

accelerate launch --gpu_ids='0,' --num_processes=1 --num_machines=1 --main_process_port 28700 train_controlnet_flux.py \ --pretrained_model_name_or_path="black-forest-labs/FLUX.1-schnell" \ --dataset_name="lucataco/fill1k" \ --conditioning_image_column=conditioning_image \ --image_column=image \ --caption_column=text \ --output_dir="logs" \ --mixed_precision="bf16" \ --resolution=512 \ --learning_rate=1e-5 \ --max_train_steps=15000 \ --validation_steps=100 \ --checkpointing_steps=200 \ --validation_image "./example_images/conditioning_image_1.png" "./example_images/conditioning_image_2.png" \ --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --report_to="tensorboard" \ --num_double_layers=2 \ --num_single_layers=0 \ --seed=42 \ --enable_model_cpu_offload \ --use_8bit_adam \ --use_adafactor \ --gradient_checkpointing \

Logs

[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
[rank0]: making sure all `forward` function outputs participate in calculating loss.
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 64 65 72 73 74 75
[rank0]:  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

System Info

Who can help?

@sayakpaul

wangherr commented 1 month ago

I solve it by:

flux_controlnet.train()
if args.num_single_layers == 0:
    flux_controlnet.transformer_blocks[-1].attn.to_add_out.requires_grad_(False)
    flux_controlnet.transformer_blocks[-1].ff_context.requires_grad_(False)
...
# params_to_optimize = flux_controlnet.parameters()
params_to_optimize = [param for param in flux_controlnet.parameters() if param.requires_grad]

but I am not sure if my modifications are logically correct

sayakpaul commented 1 month ago

Cc: @PromeAIpro

RaccoonDML commented 1 month ago

I met the same problem.

RaccoonDML commented 1 month ago

I solve it by:↳

flux_controlnet.train()
if args.num_single_layers == 0:
    flux_controlnet.transformer_blocks[-1].attn.to_add_out.requires_grad_(False)
    flux_controlnet.transformer_blocks[-1].ff_context.requires_grad_(False)
...
# params_to_optimize = flux_controlnet.parameters()
params_to_optimize = [param for param in flux_controlnet.parameters() if param.requires_grad]

but I am not sure if my modifications are logically correct↳

I wonder why you change this two modules, and If the last transformer_blocks's requires_grad is False, can the gradient be backward to the former layers? Thanks!

if args.num_single_layers == 0:
    flux_controlnet.transformer_blocks[-1].attn.to_add_out.requires_grad_(False)
    flux_controlnet.transformer_blocks[-1].ff_context.requires_grad_(False)
RaccoonDML commented 1 month ago

I solved it by using deepspeed, zero_stage:2

wangherr commented 1 month ago

I solve it by:↳

flux_controlnet.train()
if args.num_single_layers == 0:
    flux_controlnet.transformer_blocks[-1].attn.to_add_out.requires_grad_(False)
    flux_controlnet.transformer_blocks[-1].ff_context.requires_grad_(False)
...
# params_to_optimize = flux_controlnet.parameters()
params_to_optimize = [param for param in flux_controlnet.parameters() if param.requires_grad]

but I am not sure if my modifications are logically correct↳

I wonder why you change this two modules, and If the last transformer_blocks's requires_grad is False, can the gradient be backward to the former layers? Thanks!

if args.num_single_layers == 0:
    flux_controlnet.transformer_blocks[-1].attn.to_add_out.requires_grad_(False)
    flux_controlnet.transformer_blocks[-1].ff_context.requires_grad_(False)

In double block, there is text attn and image attn, I just remove the grad of text attn.

Zheng-Fang-CH commented 3 weeks ago

Hi, do you meet the similar error when training the controlnet_sd3?