Advanced training SD1.5 has an issue when saving checkpoints

josemerinom commented 4 months ago

Describe the bug

Today I trained using examples/dreambooth/train_dreambooth_lora.py in google colab, everything was OK

I wanted to try examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py I use the stable diffusion 1.5 model original (which I cloned on my HF), but when I try to save to the checkpoint, an error is generated

dataset = 10 images

checkpointing_steps=10 --> ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>

other error When I change the checkpoint to a number different from the number of images: checkpointing_steps=20 --> NameError: free variable 'pipeline' referenced before assignment in enclosing scope

validation prompt: None 06/30/2024 01:09:00 - INFO - main - Running training

Reproduction

%cd /content !mkdir /content/cache !mkdir /content/dataset !mkdir /content/log !mkdir /content/train !git clone --branch v0.29.2-patch https://github.com/huggingface/diffusers !pip install accelerate==0.31.0 !pip install datasets==2.19.0 !pip install ftfy==6.2.0 !pip install Jinja2==3.1.4 !pip install peft==0.11.1 !pip install tensorboard==2.15.2 !pip install torchvision==0.18.0+cu121 !pip install transformers==4.42.3 %cd /content/diffusers !pip install -e . !accelerate config %cd /content/diffusers/examples/advanced_diffusion_training

https://colab.research.google.com/github/josemerinom/test/blob/master/test.ipynb

Logs 1 (checkpointing_steps=10)

06/30/2024 01:08:45 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'prediction_type', 'variance_type', 'dynamic_thresholding_ratio', 'clip_sample_range', 'thresholding', 'timestep_spacing', 'rescale_betas_zero_snr', 'sample_max_value'} was not found in config. Values will be initialized to default values.
{'use_post_quant_conv', 'force_upcast', 'use_quant_conv', 'latents_std', 'scaling_factor', 'shift_factor', 'latents_mean'} was not found in config. Values will be initialized to default values.
{'num_class_embeds', 'encoder_hid_dim', 'projection_class_embeddings_input_dim', 'time_embedding_act_fn', 'use_linear_projection', 'resnet_skip_time_act', 'mid_block_only_cross_attention', 'dual_cross_attention', 'attention_type', 'time_cond_proj_dim', 'addition_embed_type_num_heads', 'time_embedding_type', 'conv_out_kernel', 'reverse_transformer_layers_per_block', 'class_embeddings_concat', 'resnet_time_scale_shift', 'class_embed_type', 'transformer_layers_per_block', 'encoder_hid_dim_type', 'conv_in_kernel', 'only_cross_attention', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'cross_attention_norm', 'addition_embed_type', 'time_embedding_dim', 'mid_block_type', 'dropout', 'num_attention_heads', 'timestep_post_act', 'upcast_attention'} was not found in config. Values will be initialized to default values.
validation prompt: None
06/30/2024 01:09:00 - INFO - __main__ - ***** Running training *****
06/30/2024 01:09:00 - INFO - __main__ -   Num examples = 10
06/30/2024 01:09:00 - INFO - __main__ -   Num batches each epoch = 10
06/30/2024 01:09:00 - INFO - __main__ -   Num Epochs = 10
06/30/2024 01:09:00 - INFO - __main__ -   Instantaneous batch size per device = 1
06/30/2024 01:09:00 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2024 01:09:00 - INFO - __main__ -   Gradient Accumulation steps = 1
06/30/2024 01:09:00 - INFO - __main__ -   Total optimization steps = 100
Steps:  10% 10/100 [00:07<00:50,  1.80it/s, loss=0.00439, lr=0.0001]06/30/2024 01:09:07 - INFO - accelerate.accelerator - Saving current state to /content/drive/MyDrive/train/checkpoint-10
/usr/local/lib/python3.10/dist-packages/peft/utils/save_and_load.py:195: UserWarning: Could not find a config file in /content/drive/MyDrive/zero/zero15 - will assume that the vocabulary was not modified.
  warnings.warn(
Traceback (most recent call last):
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2002, in <module>
    main(args)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1791, in main
    accelerator.save_state(save_path)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2955, in save_state
    hook(self._models, weights, output_dir)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1293, in save_model_hook
    raise ValueError(f"unexpected save model: {model.__class__}")
ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
Steps:  10% 10/100 [00:07<01:11,  1.25it/s, loss=0.00439, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

Logs 2 (checkpointing_steps=20)

06/30/2024 01:11:01 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'rescale_betas_zero_snr', 'variance_type', 'sample_max_value', 'thresholding', 'timestep_spacing', 'dynamic_thresholding_ratio', 'clip_sample_range', 'prediction_type'} was not found in config. Values will be initialized to default values.
{'latents_std', 'latents_mean', 'shift_factor', 'scaling_factor', 'force_upcast', 'use_quant_conv', 'use_post_quant_conv'} was not found in config. Values will be initialized to default values.
{'encoder_hid_dim', 'dropout', 'attention_type', 'resnet_out_scale_factor', 'time_embedding_type', 'conv_out_kernel', 'mid_block_only_cross_attention', 'transformer_layers_per_block', 'addition_embed_type_num_heads', 'num_attention_heads', 'only_cross_attention', 'num_class_embeds', 'time_embedding_act_fn', 'mid_block_type', 'addition_time_embed_dim', 'encoder_hid_dim_type', 'resnet_time_scale_shift', 'dual_cross_attention', 'class_embed_type', 'upcast_attention', 'resnet_skip_time_act', 'use_linear_projection', 'class_embeddings_concat', 'time_embedding_dim', 'addition_embed_type', 'conv_in_kernel', 'reverse_transformer_layers_per_block', 'timestep_post_act', 'projection_class_embeddings_input_dim', 'cross_attention_norm', 'time_cond_proj_dim'} was not found in config. Values will be initialized to default values.
validation prompt: None
06/30/2024 01:11:15 - INFO - __main__ - ***** Running training *****
06/30/2024 01:11:15 - INFO - __main__ -   Num examples = 10
06/30/2024 01:11:15 - INFO - __main__ -   Num batches each epoch = 10
06/30/2024 01:11:15 - INFO - __main__ -   Num Epochs = 10
06/30/2024 01:11:15 - INFO - __main__ -   Instantaneous batch size per device = 1
06/30/2024 01:11:15 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
06/30/2024 01:11:15 - INFO - __main__ -   Gradient Accumulation steps = 1
06/30/2024 01:11:15 - INFO - __main__ -   Total optimization steps = 100
Steps:  10% 10/100 [00:08<00:51,  1.74it/s, loss=0.125, lr=0.0001]  Traceback (most recent call last):
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2002, in <module>
    main(args)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1854, in main
    images = [
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1855, in <listcomp>
    pipeline(**pipeline_args, generator=generator).images[0]
NameError: free variable 'pipeline' referenced before assignment in enclosing scope
Steps:  10% 10/100 [00:08<01:18,  1.14it/s, loss=0.125, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)

System Info

🤗 Diffusers version: 0.29.2
Platform: Linux-6.1.85+-x86_64-with-glibc2.35
Running on a notebook?: No
Running on Google Colab?: No
Python version: 3.10.12
PyTorch version (GPU?): 2.3.0+cu121 (True)
Flax version (CPU?/GPU?/TPU?): 0.8.4 (gpu)
Jax version: 0.4.26
JaxLib version: 0.4.26
Huggingface_hub version: 0.23.4
Transformers version: 4.42.3
Accelerate version: 0.31.0
PEFT version: 0.11.1
Bitsandbytes version: not installed
Safetensors version: 0.4.3
xFormers version: not installed
Accelerator: Tesla T4, 15360 MiB VRAM
Using GPU in script?:
Using distributed or parallel set-up in script?:

sayakpaul commented 4 months ago

Cc: @linoytsaban

DN6 commented 3 months ago

@josemerinom Thanks for reporting. Opened a PR to fix #8753

josemerinom commented 3 months ago

@linoytsaban @DN6

Hello, I tested the changes in the branch --branch dreambooth-advanced I still get the error when saving But only when I use the --train_text_encoder parameter When I don't use --train_text_encoder it saves the checkpoint

reproduction > https://colab.research.google.com/github/josemerinom/test/blob/master/test2.ipynb

DN6 commented 3 months ago

@josemerinom Could you share the exact traceback here? Not a screenshot.

josemerinom commented 3 months ago

@josemerinom Could you share the exact traceback here? Not a screenshot.

Here is test 2 that I did with the changes that were made in the code https://colab.research.google.com/github/josemerinom/test/blob/master/test2.ipynb

here what you request:

Reproduction

%cd /content
!mkdir /content/dataset
!mkdir /content/log
!mkdir /content/train
!git clone --branch dreambooth-advanced https://github.com/huggingface/diffusers
!pip install accelerate==0.31.0
!pip install datasets==2.19.0
!pip install ftfy==6.2.0
!pip install Jinja2==3.1.4
!pip install peft==0.11.1
!pip install tensorboard==2.15.2
!pip install torchvision==0.18.0+cu121
!pip install transformers==4.42.3
%cd /content/diffusers
!pip install -e .
!accelerate config
%cd /content/diffusers/examples/advanced_diffusion_training
!accelerate launch --num_cpu_threads_per_process=1 train_dreambooth_lora_sd15_advanced.py \
  --adam_beta1=0.9 \
  --adam_beta2=0.999 \
  --adam_epsilon=1e-8 \
  --adam_weight_decay=0.01 \
  --checkpointing_steps=10 \
  --dataloader_num_workers=0 \
  --gradient_accumulation_steps=1 \
  --instance_data_dir="/content/dataset" \
  --instance_prompt="c4myl4" \
  --learning_rate=1e-4 \
  --logging_dir="/content/log" \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_grad_norm=1 \
  --max_train_steps=100 \
  --mixed_precision="fp16" \
  --optimizer="AdamW" \
  --output_dir="/content/train" \
  --pretrained_model_name_or_path="josemerinom/zero15" \
  --prior_loss_weight=1 \
  --rank=32 \
  --resolution=512 \
  --seed=0 \
  --text_encoder_lr=1e-4 \
  --train_batch_size=1 \
  --train_text_encoder \
  #

Logs

2024-07-01 17:07:37.574871: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-01 17:07:37.574933: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-01 17:07:37.576329: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-01 17:07:37.590021: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-01 17:07:39.007929: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
07/01/2024 17:07:41 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

tokenizer/tokenizer_config.json: 100% 806/806 [00:00<00:00, 4.23MB/s]
tokenizer/vocab.json: 100% 1.06M/1.06M [00:00<00:00, 7.92MB/s]
tokenizer/merges.txt: 100% 525k/525k [00:00<00:00, 2.65MB/s]
tokenizer/special_tokens_map.json: 100% 472/472 [00:00<00:00, 2.87MB/s]
text_encoder/config.json: 100% 617/617 [00:00<00:00, 3.92MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
scheduler/scheduler_config.json: 100% 308/308 [00:00<00:00, 1.88MB/s]
{'timestep_spacing', 'thresholding', 'sample_max_value', 'rescale_betas_zero_snr', 'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'prediction_type'} was not found in config. Values will be initialized to default values.
model.safetensors: 100% 492M/492M [00:10<00:00, 46.9MB/s]
vae/config.json: 100% 547/547 [00:00<00:00, 2.60MB/s]
diffusion_pytorch_model.safetensors: 100% 335M/335M [00:02<00:00, 134MB/s]
{'scaling_factor', 'use_post_quant_conv', 'shift_factor', 'latents_std', 'force_upcast', 'use_quant_conv', 'latents_mean'} was not found in config. Values will be initialized to default values.
unet/config.json: 100% 743/743 [00:00<00:00, 4.52MB/s]
diffusion_pytorch_model.safetensors: 100% 3.44G/3.44G [01:22<00:00, 41.4MB/s]
{'addition_embed_type', 'class_embed_type', 'mid_block_only_cross_attention', 'class_embeddings_concat', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'time_embedding_act_fn', 'reverse_transformer_layers_per_block', 'resnet_skip_time_act', 'attention_type', 'time_embedding_dim', 'resnet_time_scale_shift', 'conv_in_kernel', 'conv_out_kernel', 'timestep_post_act', 'num_class_embeds', 'upcast_attention', 'encoder_hid_dim', 'addition_embed_type_num_heads', 'mid_block_type', 'only_cross_attention', 'time_cond_proj_dim', 'time_embedding_type', 'encoder_hid_dim_type', 'dropout', 'dual_cross_attention', 'use_linear_projection', 'projection_class_embeddings_input_dim', 'cross_attention_norm', 'transformer_layers_per_block', 'num_attention_heads'} was not found in config. Values will be initialized to default values.
validation prompt: None
07/01/2024 17:09:25 - INFO - __main__ - ***** Running training *****
07/01/2024 17:09:25 - INFO - __main__ -   Num examples = 10
07/01/2024 17:09:25 - INFO - __main__ -   Num batches each epoch = 10
07/01/2024 17:09:25 - INFO - __main__ -   Num Epochs = 10
07/01/2024 17:09:25 - INFO - __main__ -   Instantaneous batch size per device = 1
07/01/2024 17:09:25 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
07/01/2024 17:09:25 - INFO - __main__ -   Gradient Accumulation steps = 1
07/01/2024 17:09:25 - INFO - __main__ -   Total optimization steps = 100
Steps:  10% 10/100 [00:08<00:49,  1.81it/s, loss=0.00326, lr=0.0001]07/01/2024 17:09:34 - INFO - accelerate.accelerator - Saving current state to /content/train/checkpoint-10
Traceback (most recent call last):
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2012, in <module>
    main(args)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1802, in main
    accelerator.save_state(save_path)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2955, in save_state
    hook(self._models, weights, output_dir)
  File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1293, in save_model_hook
    raise ValueError(f"unexpected save model: {model.__class__}")
ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
Steps:  10% 10/100 [00:08<01:20,  1.12it/s, loss=0.00326, lr=0.0001]
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth_lora_sd15_advanced.py', '--adam_beta1=0.9', '--adam_beta2=0.999', '--adam_epsilon=1e-8', '--adam_weight_decay=0.01', '--checkpointing_steps=10', '--dataloader_num_workers=0', '--gradient_accumulation_steps=1', '--instance_data_dir=/content/dataset', '--instance_prompt=c4myl4', '--learning_rate=1e-4', '--logging_dir=/content/log', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_grad_norm=1', '--max_train_steps=100', '--mixed_precision=fp16', '--optimizer=AdamW', '--output_dir=/content/train', '--pretrained_model_name_or_path=josemerinom/zero15', '--prior_loss_weight=1', '--rank=32', '--resolution=512', '--seed=0', '--text_encoder_lr=1e-4', '--train_batch_size=1', '--train_text_encoder']' returned non-zero exit status 1.

DN6 commented 3 months ago

@josemerinom Should be fixed in main now.

josemerinom commented 3 months ago

@DN6

Test 3: --branch main

Reproduction

https://colab.research.google.com/github/josemerinom/test/blob/master/test3.ipynb

Results

training start: OK save checkpoint: OK training completed: OK test no lora / step 50 / step 100: OK

The learning was done, but... I only used 5 images and 100 steps, the learning is low (few steps)

I will try training more steps and using dora (this is the reason I want to use Advanced training)

Thanks

huggingface / diffusers