Closed linoytsaban closed 10 months ago
Hello, could you please share the launch command and the versions of the Accelerate, Diffusers and Transformers?
Edited to include these, thanks!
When you create a new optimizer mid-way, it breaks the scheduler which was using the previous optimizer, the Accelerator object has no idea about this new optimizer as it wasn't prepared via accelerator.prepare
. This leaves the optimizer and scheduler in a broken state. Instead of creating a new optimizer, please set the learning rate of these param groups to 0 as is done in https://github.com/cloneofsimo/lora/blob/bdd51b04c49fa90a88919a19850ec3b4cf3c5ecd/training_scripts/train_lora_w_ti.py#L987-L994.
Hope this helps.
In this gist you can find the suggested changes and post these training is working as expected without this issue: https://gist.github.com/pacman100/fb01751ddecae63757c90d11f8cf13f9
command:
accelerate launch dreambooth_with_pivotal_tuning_inversion.py --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" --dataset_name="LinoyTsaban/3d_icon" --instance_prompt="a TOK icon" --output_dir="3d_icon_SDXL_LoRA" --mixed_precision="fp16" --resolution=1024 --train_batch_size=1 --gradient_accumulation_steps=1 --gradient_checkpointing --train_text_encoder_ti --train_text_encoder_ti_frac=0.5 --lr_scheduler="constant" --lr_warmup_steps=0 --max_train_steps=100 --checkpointing_steps=2000 --seed="0"
output logs:
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `1`
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/raid/sourab/accelerate/src/accelerate/accelerator.py:382: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
11/28/2023 11:10:01 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Mixed precision type: fp16
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'thresholding', 'variance_type', 'dynamic_thresholding_ratio'} was not found in config. Values will be initialized to default values.
{'dropout', 'reverse_transformer_layers_per_block', 'attention_type'} was not found in config. Values will be initialized to default values.
0 text encodedr's std_token_embedding: 0.01531801838427782
torch.Size([49410])
1 text encodedr's std_token_embedding: 0.014434970915317535
torch.Size([49410])
Resolving data files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 24/24 [00:00<00:00, 74.10it/s]
/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/PIL/Image.py:3157: DecompressionBombWarning: Image size (122880000 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
warnings.warn(
/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/PIL/Image.py:3157: DecompressionBombWarning: Image size (132710400 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
warnings.warn(
11/28/2023 11:10:23 - INFO - __main__ - No caption column provided, defaulting to instance_prompt for all images. If your dataset contains captions/prompts for the images, make sure to specify the column as --caption_column
validation prompt: None
11/28/2023 11:10:23 - INFO - __main__ - ***** Running training *****
11/28/2023 11:10:23 - INFO - __main__ - Num examples = 23
11/28/2023 11:10:23 - INFO - __main__ - Num batches each epoch = 23
11/28/2023 11:10:23 - INFO - __main__ - Num Epochs = 5
11/28/2023 11:10:23 - INFO - __main__ - Instantaneous batch size per device = 1
11/28/2023 11:10:23 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 1
11/28/2023 11:10:23 - INFO - __main__ - Gradient Accumulation steps = 1
11/28/2023 11:10:23 - INFO - __main__ - Total optimization steps = 100
Steps: 46%|ββββββββββββββββββββββββββββββββββββββββββββββββ | 46/100 [01:09<01:19, 1.47s/it, loss=0.00487, lr=0.0001]PIVOT HALFWAY 2
Steps: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 100/100 [02:30<00:00, 1.44s/it, loss=0.0117, lr=0.0001][2023-11-28 11:12:54,454] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Model weights saved in 3d_icon_SDXL_LoRA/pytorch_lora_weights.safetensors
model_index.json: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 609/609 [00:00<00:00, 7.20MB/s]
diffusion_pytorch_model.safetensors: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 335M/335M [00:00<00:00, 486MB/s]
Fetching 17 files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 17/17 [00:00<00:00, 20.07it/s]
Fetching 17 files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 17/17 [00:00<00:00, 20.07it/sLoaded text_encoder_2 as CLIPTextModelWithProjection from `text_encoder_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0. | 0/7 [00:00<?, ?it/s]
{'dropout', 'reverse_transformer_layers_per_block', 'attention_type'} was not found in config. Values will be initialized to default values. | 1/7 [00:00<00:05, 1.18it/s]
Loaded unet as UNet2DConditionModel from `unet` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0. | 3/7 [00:03<00:04, 1.09s/it]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.βββββββββββββββ | 6/7 [00:03<00:00, 1.97it/s]
Loading pipeline components...: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 7/7 [00:03<00:00, 1.93it/s]
{'algorithm_type', 'dynamic_thresholding_ratio', 'lambda_min_clipped', 'solver_type', 'variance_type', 'use_lu_lambdas', 'euler_at_final', 'solver_order', 'thresholding', 'lower_order_final'} was not found in config. Values will be initialized to default values.
Loading unet.
Steps: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 100/100 [02:37<00:00, 1.57s/it, loss=0.0117, lr=0.0001]
Also, as a side note, this script breakdown when using tensorboard because the config now has lists as values which breaks the tensorboard logging. Notice, 'token_abstraction': ['TOK']
.
{'pretrained_model_name_or_path': 'stabilityai/stable-diffusion-xl-base-1.0', 'pretrained_vae_model_name_or_path': 'madebyollin/sdxl-vae-fp16-fix', 'revision': None, 'dataset_name': 'LinoyTsaban/3d_icon', 'dataset_config_name': None, 'instance_data_dir': None, 'cache_dir': None, 'image_column': 'image', 'caption_column': None, 'repeats': 1, 'class_data_dir': None, 'instance_prompt': 'a <s0><s1> icon', 'token_abstraction': ['TOK'], 'num_new_tokens_per_abstraction': 2, 'class_prompt': None, 'validation_prompt': None, 'num_validation_images': 4, 'validation_epochs': 50, 'with_prior_preservation': False, 'prior_loss_weight': 1.0, 'num_class_images': 100, 'output_dir': '3d_icon_SDXL_LoRA', 'seed': 0, 'resolution': 1024, 'crops_coords_top_left_h': 0, 'crops_coords_top_left_w': 0, 'center_crop': False, 'train_text_encoder': False, 'train_batch_size': 1, 'sample_batch_size': 4, 'num_train_epochs': 5, 'max_train_steps': 100, 'checkpointing_steps': 2000, 'checkpoints_total_limit': None, 'resume_from_checkpoint': None, 'gradient_accumulation_steps': 1, 'gradient_checkpointing': True, 'learning_rate': 0.0001, 'text_encoder_lr': 5e-06, 'scale_lr': False, 'lr_scheduler': 'constant', 'snr_gamma': None, 'lr_warmup_steps': 0, 'lr_num_cycles': 1, 'lr_power': 1.0, 'dataloader_num_workers': 0, 'train_text_encoder_ti': True, 'train_text_encoder_ti_frac': 0.5, 'train_text_encoder_frac': 1.0, 'optimizer': 'adamW', 'use_8bit_adam': False, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'prodigy_beta3': None, 'prodigy_decouple': True, 'adam_weight_decay': 0.0001, 'adam_weight_decay_text_encoder': None, 'adam_epsilon': 1e-08, 'prodigy_use_bias_correction': True, 'prodigy_safeguard_warmup': True, 'max_grad_norm': 1.0, 'push_to_hub': False, 'hub_token': None, 'hub_model_id': None, 'logging_dir': 'logs', 'allow_tf32': False, 'report_to': 'tensorboard', 'mixed_precision': 'fp16', 'prior_generation_precision': None, 'local_rank': -1, 'enable_xformers_memory_efficient_attention': False, 'rank': 4}
Traceback (most recent call last):
File "/raid/sourab/temp/issues/accelerate/train_dreambooth_lora_sdxl_advanced.py", line 1987, in <module>
main(args)
File "/raid/sourab/temp/issues/accelerate/train_dreambooth_lora_sdxl_advanced.py", line 1570, in main
accelerator.init_trackers("dreambooth-lora-sd-xl", config=vars(args))
File "/raid/sourab/accelerate/src/accelerate/accelerator.py", line 617, in _inner
return PartialState().on_main_process(function)(*args, **kwargs)
File "/raid/sourab/accelerate/src/accelerate/accelerator.py", line 2335, in init_trackers
tracker.store_init_configuration(config)
File "/raid/sourab/accelerate/src/accelerate/tracking.py", line 75, in execute_on_main_process
return PartialState().on_main_process(function)(self, *args, **kwargs)
File "/raid/sourab/accelerate/src/accelerate/tracking.py", line 207, in store_init_configuration
self.writer.add_hparams(values, metric_dict={})
File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 332, in add_hparams
exp, ssi, sei = hparams(hparam_dict, metric_dict, hparam_domain_discrete)
File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/utils/tensorboard/summary.py", line 281, in hparams
raise ValueError(
ValueError: value should be one of int, float, str, bool, or torch.Tensor
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
I'm building a training script (Dreambooth LoRA) and we are having an issue with
accelerate
and modifying the optimizer mid-training:we have an operation to stop training the text-encoder midway through the steps but keep training the unet, meaning we _drop the textencoder parameters halfway, and re-initialize the optimizer only with the unet parameters.
Specifically, itβs failing when training in
fp16
. The same operation/code works fine when training inbf16
orfp32
This is the error we get when trying to do this operation in
fp16
RuntimeError: unscale_() has already been called on this optimizer since the last update().
The code where the optimizer update takes place: https://github.com/linoytsaban/diffusers/blob/0af8f44755f0ca6e9e835f85f285a2d7133c[β¦]anced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py
Environment: Accelerate==0.24.1 Diffusers==0.24.0.dev0 Transformers==4.30.2
Launch command:
*You can reduce
--max_train_steps=1000
significantly to get to the breaking point faster