Error when setting num_single_layers=0 while training flux-controlnet on a multi-GPU server using a single GPU

wangherr commented 1 month ago

Describe the bug

While training flux-controlnet on a multi-GPU server and restricting the training to a single GPU, setting _num_singlelayers=0 leads to an error:

[rank0]: Parameter indices which did not receive grad for rank 0: 64 65 72 73 74 75

Reproduction

accelerate launch --gpu_ids='0,' --num_processes=1 --num_machines=1 --main_process_port 28700 train_controlnet_flux.py \ --pretrained_model_name_or_path="black-forest-labs/FLUX.1-schnell" \ --dataset_name="lucataco/fill1k" \ --conditioning_image_column=conditioning_image \ --image_column=image \ --caption_column=text \ --output_dir="logs" \ --mixed_precision="bf16" \ --resolution=512 \ --learning_rate=1e-5 \ --max_train_steps=15000 \ --validation_steps=100 \ --checkpointing_steps=200 \ --validation_image "./example_images/conditioning_image_1.png" "./example_images/conditioning_image_2.png" \ --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ --train_batch_size=1 \ --gradient_accumulation_steps=1 \ --report_to="tensorboard" \ --num_double_layers=2 \ --num_single_layers=0 \ --seed=42 \ --enable_model_cpu_offload \ --use_8bit_adam \ --use_adafactor \ --gradient_checkpointing \

Logs

[rank0]: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
[rank0]: making sure all `forward` function outputs participate in calculating loss.
[rank0]: If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
[rank0]: Parameter indices which did not receive grad for rank 0: 64 65 72 73 74 75
[rank0]:  In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error

System Info

🤗 Diffusers version: 0.31.0.dev0
Platform: Linux-5.14.0-427.33.1.el9_4.x86_64-x86_64-with-glibc2.34
Running on Google Colab?: No
Python version: 3.12.4
PyTorch version (GPU?): 2.4.1+cu121 (True)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Huggingface_hub version: 0.24.7
Transformers version: 4.45.0
Accelerate version: 0.33.0
PEFT version: 0.12.0
Bitsandbytes version: 0.44.1
Safetensors version: 0.4.4
xFormers version: 0.0.28
Accelerator: NVIDIA RTX A6000, 49140 MiB NVIDIA RTX A6000, 49140 MiB NVIDIA RTX A6000, 49140 MiB
Using GPU in script?:
Using distributed or parallel set-up in script?: Yes

Who can help?

@sayakpaul

wangherr commented 1 month ago

I solve it by:

flux_controlnet.train()
if args.num_single_layers == 0:
    flux_controlnet.transformer_blocks[-1].attn.to_add_out.requires_grad_(False)
    flux_controlnet.transformer_blocks[-1].ff_context.requires_grad_(False)
...
# params_to_optimize = flux_controlnet.parameters()
params_to_optimize = [param for param in flux_controlnet.parameters() if param.requires_grad]

but I am not sure if my modifications are logically correct

sayakpaul commented 1 month ago

Cc: @PromeAIpro

RaccoonDML commented 1 month ago

I met the same problem.

RaccoonDML commented 1 month ago

I solve it by:↳

flux_controlnet.train()
if args.num_single_layers == 0:
    flux_controlnet.transformer_blocks[-1].attn.to_add_out.requires_grad_(False)
    flux_controlnet.transformer_blocks[-1].ff_context.requires_grad_(False)
...
# params_to_optimize = flux_controlnet.parameters()
params_to_optimize = [param for param in flux_controlnet.parameters() if param.requires_grad]

but I am not sure if my modifications are logically correct↳

I wonder why you change this two modules, and If the last transformer_blocks's requires_grad is False, can the gradient be backward to the former layers? Thanks!

if args.num_single_layers == 0:
    flux_controlnet.transformer_blocks[-1].attn.to_add_out.requires_grad_(False)
    flux_controlnet.transformer_blocks[-1].ff_context.requires_grad_(False)

RaccoonDML commented 1 month ago

I solved it by using deepspeed, zero_stage:2

wangherr commented 1 month ago

I solve it by:↳
flux_controlnet.train()
if args.num_single_layers == 0:
    flux_controlnet.transformer_blocks[-1].attn.to_add_out.requires_grad_(False)
    flux_controlnet.transformer_blocks[-1].ff_context.requires_grad_(False)
...
# params_to_optimize = flux_controlnet.parameters()
params_to_optimize = [param for param in flux_controlnet.parameters() if param.requires_grad]
but I am not sure if my modifications are logically correct↳
I wonder why you change this two modules, and If the last transformer_blocks's requires_grad is False, can the gradient be backward to the former layers? Thanks!
if args.num_single_layers == 0:
    flux_controlnet.transformer_blocks[-1].attn.to_add_out.requires_grad_(False)
    flux_controlnet.transformer_blocks[-1].ff_context.requires_grad_(False)

In double block, there is text attn and image attn, I just remove the grad of text attn.

Zheng-Fang-CH commented 3 weeks ago

Hi, do you meet the similar error when training the controlnet_sd3?

huggingface / diffusers