Closed rezkanas closed 2 months ago
It could be a dataloading problem, because I don't see this error when using the example from the README. I would suggest debugging the dataloader.
Hi again, I run the code with the 'cat' example from the README file and it works fine. I also adjusted to 'person' and it works.
The problem comes when you activate the --freeze_model crossattn
attribute as README file suggests when training on human faces
Can you try to reproduce it at yours? I am almost certain bug is related to '--freeze_model crossattn ', Currently, I run below code with no problem...
> '!accelerate launch train_custom_diffusion.py \
> --pretrained_model_name_or_path=$MODEL_NAME \
> --instance_data_dir=$INSTANCE_DIR \
> --output_dir=$OUTPUT_DIR \
> --class_data_dir=$class_data_dir \
> --with_prior_preservation --prior_loss_weight=1.0 \
> --class_prompt="person" --num_class_images=200 \
> --instance_prompt="photo of a <new1> person" \
> --resolution=512 \
> --train_batch_size=2 \
> --learning_rate=5e-6 \
> --lr_warmup_steps=0 \
> --max_train_steps=1200 \
> --scale_lr --hflip --noaug \
> --modifier_token "<new1>"\
> --enable_xformers_memory_efficient_attention '
when training above, I encountered a issue #4704 where I added --no_safe_serialization
attribute ... it works... but then I am encountering now a different error that I could not find in your bug resolution history....
when running this
!
accelerate launch train_custom_diffusion.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--instance_data_dir=$INSTANCE_DIR \
--output_dir=$OUTPUT_DIR \
--class_data_dir=$class_data_dir \
--with_prior_preservation --prior_loss_weight=1.0 \
--class_prompt="person" --num_class_images=200 \
--instance_prompt="photo of a <new1> person" \
--resolution=512 \
--train_batch_size=1 \
--learning_rate=5e-6 \
--lr_warmup_steps=0 \
--max_train_steps=1200 \
--scale_lr --hflip --noaug \
--no_safe_serialization\
--num_validation_images=5 \
--validation_prompt="<new1> person sitting in a bucket" \
--modifier_token "<new1>"\
--enable_xformers_memory_efficient_attention
the model finishes training and breaks only when loading the pipeline components
Loading pipeline components...: 29%|███▋ | 2/7 [00:06<00:17, 3.45s/it]Loaded tokenizer as CLIPTokenizer from
tokenizer
subfolder of CompVis/stable-diffusion-v1-4. Traceback (most recent call last): File "/home/anasrezklinux/.local/bin/accelerate", line 8, insys.exit(main()) File "/home/anasrezklinux/.local/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/home/anasrezklinux/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1017, in launch_command simple_launcher(args) File "/home/anasrezklinux/.local/lib/python3.10/site-packages/accelerate/commands/launch.py", line 637, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_custom_diffusion.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=/mnt/c/Users/noobw/PycharmProjects/pythonProject/Anas', '--output_dir=/mnt/c/Users/noobw/PycharmProjects/pythonProject/custom_diffusion_anas', '--class_data_dir=/mnt/c/Users/noobw/PycharmProjects/pythonProject/custom_diffusion_anas/class_prior', '--with_prior_preservation', '--prior_loss_weight=1.0', '--class_prompt=person', '--num_class_images=10', '--instance_prompt=photo of a person', '--resolution=512', '--train_batch_size=1', '--learning_rate=5e-6', '--lr_warmup_steps=0', '--max_train_steps=10', '--scale_lr', '--hflip', '--noaug', '--no_safe_serialization', '--num_validation_images=5', '--validation_prompt= cat sitting in a bucket', '--modifier_token', ' ', '--enable_xformers_memory_efficient_attention']' died with <Signals.SIGKILL: 9>.
The stack trace isn't informative enough. There is nothing there suggesting it's coming from the diffusers
codebase.
@rezkanas @sayakpaul Hi guys, did u address this error? I have the same issue
I resolve it as per https://github.com/huggingface/diffusers/issues/6879#issuecomment-1932818035 but I encounter another problem, https://github.com/huggingface/diffusers/issues/6879#issuecomment-1933015463 can you help @yiyixuxu @sayakpaul ??
Will try to reproduce.
I would still appreciate if I can train the model with the attribute --freeze_model crossattn
activate since I am training faces :)
Hi.
I just tried with the commands provided in the README of the example and I didn't run into any problems, whatsoever. I am on PyTorch 2.2, diffusers
installed from main
.
I have also Pytorch 2.2 and installed diffusers
from main
. do you run it with --freeze_model crossattn
activated?
Hi, I encountered a similar problem. I tried train_custom_diffusion.py
with the cat dataset, and I got an error RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half
.
Reproduction:
accelerate launch train_custom_diffusion.py \
--pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" \
--instance_data_dir="data/cat" \
--output_dir="tutorial_output" \
--instance_prompt="photo of a <new1> cat" \
--resolution=512 \
--train_batch_size=1 \
--learning_rate=5e-6 \
--lr_warmup_steps=0 \
--max_train_steps=250 \
--scale_lr \
--hflip \
--modifier_token="<new1>" \
--no_safe_serialization \
--validation_prompt="<new1> cat sitting in a bucket"
Logs:
03/06/2024 15:18:04 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'thresholding', 'rescale_betas_zero_snr', 'timestep_spacing', 'dynamic_thresholding_ratio', 'variance_type', 'prediction_type', 'sample_max_value', 'clip_sample_range'} was not found in config. Values will be initialized to default values.
{'force_upcast', 'norm_num_groups', 'latents_std', 'latents_mean'} was not found in config. Values will be initialized to default values.
/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/diffusers/models/lora.py:300: FutureWarning: `LoRACompatibleConv` is deprecated and will be removed in version 1.0.0. Use of `LoRACompatibleConv` is deprecated. Please switch to PEFT backend by installing PEFT: `pip install peft`.
deprecate("LoRACompatibleConv", "1.0.0", deprecation_message)
/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/diffusers/models/lora.py:387: FutureWarning: `LoRACompatibleLinear` is deprecated and will be removed in version 1.0.0. Use of `LoRACompatibleLinear` is deprecated. Please switch to PEFT backend by installing PEFT: `pip install peft`.
deprecate("LoRACompatibleLinear", "1.0.0", deprecation_message)
{'transformer_layers_per_block', 'time_embedding_dim', 'resnet_skip_time_act', 'time_embedding_act_fn', 'resnet_out_scale_factor', 'timestep_post_act', 'projection_class_embeddings_input_dim', 'dual_cross_attention', 'attention_type', 'time_embedding_type', 'conv_out_kernel', 'encoder_hid_dim', 'resnet_time_scale_shift', 'class_embeddings_concat', 'addition_time_embed_dim', 'dropout', 'addition_embed_type', 'mid_block_type', 'use_linear_projection', 'class_embed_type', 'cross_attention_norm', 'num_class_embeds', 'reverse_transformer_layers_per_block', 'addition_embed_type_num_heads', 'num_attention_heads', 'upcast_attention', 'time_cond_proj_dim', 'mid_block_only_cross_attention', 'only_cross_attention', 'conv_in_kernel', 'encoder_hid_dim_type'} was not found in config. Values will be initialized to default values.
[42170]
03/06/2024 15:18:12 - INFO - __main__ - ***** Running training *****
03/06/2024 15:18:12 - INFO - __main__ - Num examples = 5
03/06/2024 15:18:12 - INFO - __main__ - Num batches each epoch = 5
03/06/2024 15:18:12 - INFO - __main__ - Num Epochs = 50
03/06/2024 15:18:12 - INFO - __main__ - Instantaneous batch size per device = 1
03/06/2024 15:18:12 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
03/06/2024 15:18:12 - INFO - __main__ - Gradient Accumulation steps = 1
03/06/2024 15:18:12 - INFO - __main__ - Total optimization steps = 250
Steps: 20%|██████████████████████████▍ | 50/250 [00:18<00:58, 3.44it/s, loss=0.264, lr=5e-6]03/06/2024 15:18:30 - INFO - __main__ - Running validation...
Generating 2 images with prompt: <new1> cat sitting in a bucket.
{'image_encoder', 'requires_safety_checker'} was not found in config. Values will be initialized to default values.
{'force_upcast', 'norm_num_groups', 'latents_std', 'latents_mean'} was not found in config. Values will be initialized to default values. | 0/7 [00:00<?, ?it/s]
Loaded vae as AutoencoderKL from `vae` subfolder of CompVis/stable-diffusion-v1-4.
{'timestep_spacing', 'prediction_type'} was not found in config. Values will be initialized to default values. | 1/7 [00:00<00:00, 6.58it/s]
Loaded scheduler as PNDMScheduler from `scheduler` subfolder of CompVis/stable-diffusion-v1-4.
Loaded feature_extractor as CLIPImageProcessor from `feature_extractor` subfolder of CompVis/stable-diffusion-v1-4.
Loaded safety_checker as StableDiffusionSafetyChecker from `safety_checker` subfolder of CompVis/stable-diffusion-v1-4.
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 11.08it/s]
{'thresholding', 'rescale_betas_zero_snr', 'solver_order', 'timestep_spacing', 'final_sigmas_type', 'use_lu_lambdas', 'dynamic_thresholding_ratio', 'algorithm_type', 'variance_type', 'lambda_min_clipped', 'prediction_type', 'sample_max_value', 'solver_type', 'use_karras_sigmas', 'euler_at_final', 'lower_order_final'} was not found in config. Values will be initialized to default values.
Traceback (most recent call last):
File "/home/shinnosuke.ono/scripts/train_custom_diffusion.py", line 1355, in <module>
main(args)
File "/home/shinnosuke.ono/scripts/train_custom_diffusion.py", line 1255, in main
images = [
File "/home/shinnosuke.ono/scripts/train_custom_diffusion.py", line 1256, in <listcomp>
pipeline(args.validation_prompt, num_inference_steps=25, generator=generator, eta=1.0).images[
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 971, in __call__
noise_pred = self.unet(
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_condition.py", line 1152, in forward
emb = self.time_embedding(t_emb, timestep_cond)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/diffusers/models/embeddings.py", line 228, in forward
sample = self.linear_1(sample)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/diffusers/models/lora.py", line 447, in forward
out = super().forward(hidden_states)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 116, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half
Steps: 20%|██████████████████████████▍ | 50/250 [00:20<01:22, 2.44it/s, loss=0.264, lr=5e-6]
Traceback (most recent call last):
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1023, in launch_command
simple_launcher(args)
File "/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/lib/python3.10/site-packages/accelerate/commands/launch.py", line 643, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/shinnosuke.ono/miniforge3/envs/custom-diffusion/bin/python3.10', 'train_custom_diffusion.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--instance_data_dir=../data/cat', '--output_dir=tutorial_output', '--instance_prompt=photo of a <new1> cat', '--resolution=512', '--train_batch_size=1', '--learning_rate=5e-6', '--lr_warmup_steps=0', '--max_train_steps=250', '--scale_lr', '--hflip', '--modifier_token=<new1>', '--no_safe_serialization', '--validation_prompt=<new1> cat sitting in a bucket']' returned non-zero exit status 1.
System Info:
platform: #101~20.04.1-Ubuntu SMP Thu Nov 16 14:22:28 UTC 2023
Python: 3.10.13
pytorch (GPU?): 2.2.0+cu121(True)
accelerate: 0.27.2
diffusers: 0.27.0 dev0
huggingface_hub: 0.21.3
transformers: 4.38.2
xformers: not installed
What I tried:
--freeze_model="crossattn"
to the arguments, I got an error RuntimeError: expected mat1 and mat2 to have the same dtype, but got: c10::Half != float
, which is the same one as in this issue.--validation_prompts=...
from the arguments, the script finished successfully, but I am concerned with the deterioration of performance if the model is learned in this way.Cc: @nupurkmr9
@rezkanas @shinnosukeono, sorry for the delayed response. I believe the error might be because of float16 training. Without it I am able to train with --freeze_model="crossattn"
as well.
I will see if I can update the code to support float16 with full crossattn fine-tuning as well. In the meantime disabling that should work.
Thanks.
@nupurkmr9 Thank you so much for your reply! I changed mixed_precision: 'fp16'
to mixed_precision: 'no'
in the configuration of the accelerate module, and I got to train successfully with the validation prompt. I hope this also helps @rezkanas
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
gentle pin @nupurkmr9 are we able to support float16 with full crossattn fine-tuning as well? if not, let's maybe remove that option from the script
Describe the bug
when running custom diffusion on my 20 photos repositories ... I run into this error that is related to data type difference..
Reproduction
!accelerate launch train_custom_diffusion.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --output_dir=$OUTPUT_DIR \ --class_data_dir=$class_data_dir \ --with_prior_preservation \ --prior_loss_weight=1.0 \ --class_prompt="person" \ --num_class_images=200 \ --instance_prompt="photo of a person" \
--resolution=512 \
--train_batch_size=2 \
--learning_rate=5e-6 \
--lr_warmup_steps=0 \
--max_train_steps=1200 \
--freeze_model=crossattn \
--scale_lr \
--hflip \
--use_8bit_adam \
--gradient_checkpointing \
--enable_xformers_memory_efficient_attention \
--modifier_token "" \
--validation_prompt=" person sitting in a bucket"
Logs
System Info
diffusers
version: 0.26.1Who can help?
@sayakpaul @patrickvonplaten