Dreambooth Flux training error: RuntimeError: mat2 must be a matrix, got 1-D tensor

kopyl commented 2 months ago

Describe the bug

I run the training but get this error

Reproduction

Run accelerate config

compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: true
  fsdp_sharding_strategy: HYBRID_SHARD_ZERO2
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

!git clone https://github.com/huggingface/diffusers
%cd diffusers

%cd diffusers

!pip install -e .
!pip install -r examples/dreambooth/requirements_flux.txt
!pip install prodigyopt

import huggingface_hub
huggingface_hub.notebook_login()

MODEL_NAME="black-forest-labs/FLUX.1-dev"
INSTANCE_DIR="/dreambooth-datasets/yaremovaa"
OUTPUT_DIR="/flux-dreambooth-outputs/dreamboot-yaremovaa"

!accelerate launch examples/dreambooth/train_dreambooth_flux.py \
  --pretrained_model_name_or_path={MODEL_NAME}  \
  --instance_data_dir={INSTANCE_DIR} \
  --output_dir={OUTPUT_DIR} \
  --mixed_precision="bf16" \
  --instance_prompt="a photo of sks girl" \
  --resolution=512 \
  --train_batch_size=1 \
  --guidance_scale=1 \
  --gradient_accumulation_steps=4 \
  --optimizer="prodigy" \
  --learning_rate=1. \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1000 \
  --validation_prompt="A photo of sks girl posing in a photo studio" \
  --validation_epochs=25 \
  --seed="0"

Logs

09/23/2024 13:15:53 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 0
Local process index: 0
Device: cuda:0

Mixed precision type: bf16

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
09/23/2024 13:15:53 - INFO - __main__ - Distributed environment: FSDP  Backend: nccl
Num processes: 2
Process index: 1
Local process index: 1
Device: cuda:1

Mixed precision type: bf16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type t5 to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 18436.50it/s]
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 16946.68it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:00<00:00, 12.08it/s]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 10433.59it/s]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:02<00:00,  1.06s/it]
Fetching 3 files: 100%|████████████████████████| 3/3 [00:00<00:00, 10903.74it/s]
{'axes_dims_rope'} was not found in config. Values will be initialized to default values.
Using decoupled weight decay
Using decoupled weight decay
09/23/2024 13:16:15 - INFO - __main__ - ***** Running training *****
09/23/2024 13:16:15 - INFO - __main__ -   Num examples = 10
09/23/2024 13:16:15 - INFO - __main__ -   Num batches each epoch = 5
09/23/2024 13:16:15 - INFO - __main__ -   Num Epochs = 500
09/23/2024 13:16:15 - INFO - __main__ -   Instantaneous batch size per device = 1
09/23/2024 13:16:15 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 8
09/23/2024 13:16:15 - INFO - __main__ -   Gradient Accumulation steps = 4
09/23/2024 13:16:15 - INFO - __main__ -   Total optimization steps = 1000
Steps:   0%|                                           | 0/1000 [00:00<?, ?it/s]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/usr/local/lib/python3.8/dist-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
Steps:   0%|                           | 0/1000 [00:29<?, ?it/s, loss=0.4, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                         | 0/1000 [00:31<?, ?it/s, loss=0.416, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                         | 0/1000 [00:32<?, ?it/s, loss=0.327, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|              | 1/1000 [01:30<25:01:16, 90.17s/it, loss=0.592, lr=1]Passing `txt_ids` 3d torch.Tensor is deprecated.Please remove the batch dimension and pass it as a 2d torch Tensor
Steps:   0%|                | 2/1000 [02:31<20:13:45, 72.97s/it, loss=nan, lr=1]
Downloading shards: 100%|██████████████████████| 2/2 [00:00<00:00, 20068.44it/s]

Loading checkpoint shards:   0%|                          | 0/2 [00:00<?, ?it/s]
Loading checkpoint shards:  50%|█████████         | 1/2 [00:02<00:02,  2.53s/it]
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:04<00:00,  2.43s/it]

Loading pipeline components...:   0%|                     | 0/7 [00:00<?, ?it/s]Loaded tokenizer_2 as T5TokenizerFast from `tokenizer_2` subfolder of black-forest-labs/FLUX.1-dev.

Loading pipeline components...:  43%|█████▌       | 3/7 [00:00<00:00, 18.63it/s]Loaded scheduler as FlowMatchEulerDiscreteScheduler from `scheduler` subfolder of black-forest-labs/FLUX.1-dev.
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of black-forest-labs/FLUX.1-dev.
Loading pipeline components...: 100%|█████████████| 7/7 [00:00<00:00, 34.91it/s]
09/23/2024 13:18:51 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: A photo of sks girl posing in a photo studio.
[rank0]: Traceback (most recent call last):
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1791, in <module>
[rank0]:     main(args)
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 1715, in main
[rank0]:     images = log_validation(
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 173, in log_validation
[rank0]:     images = [pipeline(**pipeline_args, generator=generator).images[0] for _ in range(args.num_validation_images)]
[rank0]:   File "examples/dreambooth/train_dreambooth_flux.py", line 173, in <listcomp>
[rank0]:     images = [pipeline(**pipeline_args, generator=generator).images[0] for _ in range(args.num_validation_images)]
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/diffusers/src/diffusers/pipelines/flux/pipeline_flux.py", line 719, in __call__
[rank0]:     noise_pred = self.transformer(
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 820, in forward
[rank0]:     return model_forward(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/accelerate/utils/operations.py", line 808, in __call__
[rank0]:     return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/diffusers/src/diffusers/models/transformers/transformer_flux.py", line 442, in forward
[rank0]:     hidden_states = self.x_embedder(hidden_states)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/linear.py", line 117, in forward
[rank0]:     return F.linear(input, self.weight, self.bias)
[rank0]: RuntimeError: mat2 must be a matrix, got 1-D tensor
Steps:   0%|                | 2/1000 [02:57<24:35:32, 88.71s/it, loss=nan, lr=1]
[rank0]:[W923 13:19:13.356135687 ProcessGroupNCCL.cpp:1168] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
W0923 13:19:21.013317 140632834377536 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 13534 closing signal SIGTERM
E0923 13:19:28.343540 140632834377536 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 13533) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1161, in launch_command
    multi_gpu_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 799, in multi_gpu_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
examples/dreambooth/train_dreambooth_flux.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-09-23_13:19:21
  host      : x2-h100.internal.cloudapp.net
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 13533)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System Info

Ubuntu 20.04 x2 NVIDIA H100 CUDA 12.2 torch==2.4.1 torchvision==0.19.1 Diffusers commit: https://github.com/huggingface/diffusers/commit/ba5af5aebbac0cc18168076a18836f175753d1c7

Who can help?

No response

kopyl commented 2 months ago

If you remove these args:

--validation_prompt
--validation_epochs

Then the training does not stop.

a-r-r-o-w commented 2 months ago

Do you mean to say that performing validation in dreambooth-flux training does not work with the current scripts?

kopyl commented 2 months ago

@a-r-r-o-w yep. The problem is that the model can't be loaded for the inference.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

weixiong-ur commented 4 weeks ago

I wonder if there is any update on this issue? Also faced with this issue using the latest diffusers 0.32. Seems that after using FSDP to wrap the model, the shape of some of the parameter has been changed. Also, I wonder if this is related to this issue: https://github.com/huggingface/transformers/issues/30228 @kopyl Any workaround on this issue?

yiyixuxu commented 3 weeks ago

cc @linoytsaban @sayakpaul

sayakpaul commented 3 weeks ago

I think this should be solved now. Can you try with the recent versions and ensure diffusers is installed from the main?

Also, maybe try after changing your accelerate config use a single GPU? num_processes: 1?

kopyl commented 3 weeks ago

@weixiong-ur did not find a solution and decided to switch to kohya-ss sd-scripts for the training which works almost wonders.

@sayakpaul the thing is that i had to use x2 GPU devices to fit everything into the memory...

huggingface / diffusers