Closed josemerinom closed 3 months ago
Cc: @linoytsaban
@josemerinom Thanks for reporting. Opened a PR to fix #8753
@linoytsaban @DN6
Hello, I tested the changes in the branch --branch dreambooth-advanced I still get the error when saving But only when I use the --train_text_encoder parameter When I don't use --train_text_encoder it saves the checkpoint
reproduction > https://colab.research.google.com/github/josemerinom/test/blob/master/test2.ipynb
@josemerinom Could you share the exact traceback here? Not a screenshot.
@josemerinom Could you share the exact traceback here? Not a screenshot.
Here is test 2 that I did with the changes that were made in the code https://colab.research.google.com/github/josemerinom/test/blob/master/test2.ipynb
here what you request:
%cd /content
!mkdir /content/dataset
!mkdir /content/log
!mkdir /content/train
!git clone --branch dreambooth-advanced https://github.com/huggingface/diffusers
!pip install accelerate==0.31.0
!pip install datasets==2.19.0
!pip install ftfy==6.2.0
!pip install Jinja2==3.1.4
!pip install peft==0.11.1
!pip install tensorboard==2.15.2
!pip install torchvision==0.18.0+cu121
!pip install transformers==4.42.3
%cd /content/diffusers
!pip install -e .
!accelerate config
%cd /content/diffusers/examples/advanced_diffusion_training
!accelerate launch --num_cpu_threads_per_process=1 train_dreambooth_lora_sd15_advanced.py \
--adam_beta1=0.9 \
--adam_beta2=0.999 \
--adam_epsilon=1e-8 \
--adam_weight_decay=0.01 \
--checkpointing_steps=10 \
--dataloader_num_workers=0 \
--gradient_accumulation_steps=1 \
--instance_data_dir="/content/dataset" \
--instance_prompt="c4myl4" \
--learning_rate=1e-4 \
--logging_dir="/content/log" \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--max_grad_norm=1 \
--max_train_steps=100 \
--mixed_precision="fp16" \
--optimizer="AdamW" \
--output_dir="/content/train" \
--pretrained_model_name_or_path="josemerinom/zero15" \
--prior_loss_weight=1 \
--rank=32 \
--resolution=512 \
--seed=0 \
--text_encoder_lr=1e-4 \
--train_batch_size=1 \
--train_text_encoder \
#
2024-07-01 17:07:37.574871: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-01 17:07:37.574933: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-01 17:07:37.576329: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-01 17:07:37.590021: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-01 17:07:39.007929: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
07/01/2024 17:07:41 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
tokenizer/tokenizer_config.json: 100% 806/806 [00:00<00:00, 4.23MB/s]
tokenizer/vocab.json: 100% 1.06M/1.06M [00:00<00:00, 7.92MB/s]
tokenizer/merges.txt: 100% 525k/525k [00:00<00:00, 2.65MB/s]
tokenizer/special_tokens_map.json: 100% 472/472 [00:00<00:00, 2.87MB/s]
text_encoder/config.json: 100% 617/617 [00:00<00:00, 3.92MB/s]
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
scheduler/scheduler_config.json: 100% 308/308 [00:00<00:00, 1.88MB/s]
{'timestep_spacing', 'thresholding', 'sample_max_value', 'rescale_betas_zero_snr', 'variance_type', 'clip_sample_range', 'dynamic_thresholding_ratio', 'prediction_type'} was not found in config. Values will be initialized to default values.
model.safetensors: 100% 492M/492M [00:10<00:00, 46.9MB/s]
vae/config.json: 100% 547/547 [00:00<00:00, 2.60MB/s]
diffusion_pytorch_model.safetensors: 100% 335M/335M [00:02<00:00, 134MB/s]
{'scaling_factor', 'use_post_quant_conv', 'shift_factor', 'latents_std', 'force_upcast', 'use_quant_conv', 'latents_mean'} was not found in config. Values will be initialized to default values.
unet/config.json: 100% 743/743 [00:00<00:00, 4.52MB/s]
diffusion_pytorch_model.safetensors: 100% 3.44G/3.44G [01:22<00:00, 41.4MB/s]
{'addition_embed_type', 'class_embed_type', 'mid_block_only_cross_attention', 'class_embeddings_concat', 'addition_time_embed_dim', 'resnet_out_scale_factor', 'time_embedding_act_fn', 'reverse_transformer_layers_per_block', 'resnet_skip_time_act', 'attention_type', 'time_embedding_dim', 'resnet_time_scale_shift', 'conv_in_kernel', 'conv_out_kernel', 'timestep_post_act', 'num_class_embeds', 'upcast_attention', 'encoder_hid_dim', 'addition_embed_type_num_heads', 'mid_block_type', 'only_cross_attention', 'time_cond_proj_dim', 'time_embedding_type', 'encoder_hid_dim_type', 'dropout', 'dual_cross_attention', 'use_linear_projection', 'projection_class_embeddings_input_dim', 'cross_attention_norm', 'transformer_layers_per_block', 'num_attention_heads'} was not found in config. Values will be initialized to default values.
validation prompt: None
07/01/2024 17:09:25 - INFO - __main__ - ***** Running training *****
07/01/2024 17:09:25 - INFO - __main__ - Num examples = 10
07/01/2024 17:09:25 - INFO - __main__ - Num batches each epoch = 10
07/01/2024 17:09:25 - INFO - __main__ - Num Epochs = 10
07/01/2024 17:09:25 - INFO - __main__ - Instantaneous batch size per device = 1
07/01/2024 17:09:25 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
07/01/2024 17:09:25 - INFO - __main__ - Gradient Accumulation steps = 1
07/01/2024 17:09:25 - INFO - __main__ - Total optimization steps = 100
Steps: 10% 10/100 [00:08<00:49, 1.81it/s, loss=0.00326, lr=0.0001]07/01/2024 17:09:34 - INFO - accelerate.accelerator - Saving current state to /content/train/checkpoint-10
Traceback (most recent call last):
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 2012, in <module>
main(args)
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1802, in main
accelerator.save_state(save_path)
File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2955, in save_state
hook(self._models, weights, output_dir)
File "/content/diffusers/examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py", line 1293, in save_model_hook
raise ValueError(f"unexpected save model: {model.__class__}")
ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
Steps: 10% 10/100 [00:08<01:20, 1.12it/s, loss=0.00326, lr=0.0001]
Traceback (most recent call last):
File "/usr/local/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1097, in launch_command
simple_launcher(args)
File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 703, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', 'train_dreambooth_lora_sd15_advanced.py', '--adam_beta1=0.9', '--adam_beta2=0.999', '--adam_epsilon=1e-8', '--adam_weight_decay=0.01', '--checkpointing_steps=10', '--dataloader_num_workers=0', '--gradient_accumulation_steps=1', '--instance_data_dir=/content/dataset', '--instance_prompt=c4myl4', '--learning_rate=1e-4', '--logging_dir=/content/log', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--max_grad_norm=1', '--max_train_steps=100', '--mixed_precision=fp16', '--optimizer=AdamW', '--output_dir=/content/train', '--pretrained_model_name_or_path=josemerinom/zero15', '--prior_loss_weight=1', '--rank=32', '--resolution=512', '--seed=0', '--text_encoder_lr=1e-4', '--train_batch_size=1', '--train_text_encoder']' returned non-zero exit status 1.
@josemerinom Should be fixed in main now.
@DN6
Test 3: --branch main
https://colab.research.google.com/github/josemerinom/test/blob/master/test3.ipynb
training start: OK save checkpoint: OK training completed: OK test no lora / step 50 / step 100: OK
The learning was done, but... I only used 5 images and 100 steps, the learning is low (few steps)
I will try training more steps and using dora (this is the reason I want to use Advanced training)
Thanks
Describe the bug
Today I trained using examples/dreambooth/train_dreambooth_lora.py in google colab, everything was OK
I wanted to try examples/advanced_diffusion_training/train_dreambooth_lora_sd15_advanced.py I use the stable diffusion 1.5 model original (which I cloned on my HF), but when I try to save to the checkpoint, an error is generated
dataset = 10 images
checkpointing_steps=10 --> ValueError: unexpected save model: <class 'transformers.models.clip.modeling_clip.CLIPTextModel'>
other error When I change the checkpoint to a number different from the number of images: checkpointing_steps=20 --> NameError: free variable 'pipeline' referenced before assignment in enclosing scope
validation prompt: None 06/30/2024 01:09:00 - INFO - main - Running training
Reproduction
%cd /content !mkdir /content/cache !mkdir /content/dataset !mkdir /content/log !mkdir /content/train !git clone --branch v0.29.2-patch https://github.com/huggingface/diffusers !pip install accelerate==0.31.0 !pip install datasets==2.19.0 !pip install ftfy==6.2.0 !pip install Jinja2==3.1.4 !pip install peft==0.11.1 !pip install tensorboard==2.15.2 !pip install torchvision==0.18.0+cu121 !pip install transformers==4.42.3 %cd /content/diffusers !pip install -e . !accelerate config %cd /content/diffusers/examples/advanced_diffusion_training
https://colab.research.google.com/github/josemerinom/test/blob/master/test.ipynb
Logs 1 (checkpointing_steps=10)
Logs 2 (checkpointing_steps=20)
System Info