Trouble training in fp16 when modifying the optimizer mid-training

linoytsaban commented 11 months ago

I'm building a training script (Dreambooth LoRA) and we are having an issue with accelerate and modifying the optimizer mid-training:

we have an operation to stop training the text-encoder midway through the steps but keep training the unet, meaning we _drop the textencoder parameters halfway, and re-initialize the optimizer only with the unet parameters.
Specifically, it’s failing when training in fp16 . The same operation/code works fine when training in bf16 or fp32
This is the error we get when trying to do this operation in fp16 RuntimeError: unscale_() has already been called on this optimizer since the last update().

The code where the optimizer update takes place: https://github.com/linoytsaban/diffusers/blob/0af8f44755f0ca6e9e835f85f285a2d7133c[…]anced_diffusion_training/train_dreambooth_lora_sdxl_advanced.py

Environment: Accelerate==0.24.1 Diffusers==0.24.0.dev0 Transformers==4.30.2

Launch command:

#!/usr/bin/env bash
!accelerate launch train_dreambooth_lora_sdxl_advanced.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
  --dataset_name="LinoyTsaban/3d_icon" \
  --instance_prompt="a TOK icon" \
  --output_dir="3d_icon_SDXL_LoRA" \
  --mixed_precision="fp16" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --gradient_checkpointing \
  --train_text_encoder_ti\
  --train_text_encoder_ti_frac=0.5\
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1000 \
  --checkpointing_steps=2000 \
  --seed="0" \
  --push_to_hub

*You can reduce --max_train_steps=1000 significantly to get to the breaking point faster

pacman100 commented 11 months ago

Hello, could you please share the launch command and the versions of the Accelerate, Diffusers and Transformers?

linoytsaban commented 11 months ago

Edited to include these, thanks!

pacman100 commented 11 months ago

When you create a new optimizer mid-way, it breaks the scheduler which was using the previous optimizer, the Accelerator object has no idea about this new optimizer as it wasn't prepared via accelerator.prepare. This leaves the optimizer and scheduler in a broken state. Instead of creating a new optimizer, please set the learning rate of these param groups to 0 as is done in https://github.com/cloneofsimo/lora/blob/bdd51b04c49fa90a88919a19850ec3b4cf3c5ecd/training_scripts/train_lora_w_ti.py#L987-L994.

Hope this helps.

pacman100 commented 11 months ago

In this gist you can find the suggested changes and post these training is working as expected without this issue: https://gist.github.com/pacman100/fb01751ddecae63757c90d11f8cf13f9

command:

accelerate launch dreambooth_with_pivotal_tuning_inversion.py  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0"   --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix"   --dataset_name="LinoyTsaban/3d_icon"   --instance_prompt="a TOK icon"   --output_dir="3d_icon_SDXL_LoRA"   --mixed_precision="fp16"   --resolution=1024   --train_batch_size=1   --gradient_accumulation_steps=1   --gradient_checkpointing   --train_text_encoder_ti  --train_text_encoder_ti_frac=0.5  --lr_scheduler="constant"   --lr_warmup_steps=0   --max_train_steps=100   --checkpointing_steps=2000   --seed="0"

output logs:

The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `1`
    `--num_machines` was set to a value of `1`
    `--mixed_precision` was set to a value of `'no'`
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/raid/sourab/accelerate/src/accelerate/accelerator.py:382: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
11/28/2023 11:10:01 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'thresholding', 'variance_type', 'dynamic_thresholding_ratio'} was not found in config. Values will be initialized to default values.
{'dropout', 'reverse_transformer_layers_per_block', 'attention_type'} was not found in config. Values will be initialized to default values.
0 text encodedr's std_token_embedding: 0.01531801838427782
torch.Size([49410])
1 text encodedr's std_token_embedding: 0.014434970915317535
torch.Size([49410])
Resolving data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 24/24 [00:00<00:00, 74.10it/s]
/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/PIL/Image.py:3157: DecompressionBombWarning: Image size (122880000 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
  warnings.warn(
/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/PIL/Image.py:3157: DecompressionBombWarning: Image size (132710400 pixels) exceeds limit of 89478485 pixels, could be decompression bomb DOS attack.
  warnings.warn(
11/28/2023 11:10:23 - INFO - __main__ - No caption column provided, defaulting to instance_prompt for all images. If your dataset contains captions/prompts for the images, make sure to specify the column as --caption_column
validation prompt: None
11/28/2023 11:10:23 - INFO - __main__ - ***** Running training *****
11/28/2023 11:10:23 - INFO - __main__ -   Num examples = 23
11/28/2023 11:10:23 - INFO - __main__ -   Num batches each epoch = 23
11/28/2023 11:10:23 - INFO - __main__ -   Num Epochs = 5
11/28/2023 11:10:23 - INFO - __main__ -   Instantaneous batch size per device = 1
11/28/2023 11:10:23 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
11/28/2023 11:10:23 - INFO - __main__ -   Gradient Accumulation steps = 1
11/28/2023 11:10:23 - INFO - __main__ -   Total optimization steps = 100
Steps:  46%|███████████████████████████████████████████████▍                                                       | 46/100 [01:09<01:19,  1.47s/it, loss=0.00487, lr=0.0001]PIVOT HALFWAY 2
Steps: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:30<00:00,  1.44s/it, loss=0.0117, lr=0.0001][2023-11-28 11:12:54,454] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Model weights saved in 3d_icon_SDXL_LoRA/pytorch_lora_weights.safetensors
model_index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 609/609 [00:00<00:00, 7.20MB/s]
diffusion_pytorch_model.safetensors: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 335M/335M [00:00<00:00, 486MB/s]
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 20.07it/s]
Fetching 17 files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17/17 [00:00<00:00, 20.07it/sLoaded text_encoder_2 as CLIPTextModelWithProjection from `text_encoder_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                       | 0/7 [00:00<?, ?it/s]
                                                                                                                                                                            {'dropout', 'reverse_transformer_layers_per_block', 'attention_type'} was not found in config. Values will be initialized to default values.    | 1/7 [00:00<00:05,  1.18it/s]
Loaded unet as UNet2DConditionModel from `unet` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                            Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                       | 3/7 [00:03<00:04,  1.09s/it]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                            Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.██████████████▊               | 6/7 [00:03<00:00,  1.97it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:03<00:00,  1.93it/s]
{'algorithm_type', 'dynamic_thresholding_ratio', 'lambda_min_clipped', 'solver_type', 'variance_type', 'use_lu_lambdas', 'euler_at_final', 'solver_order', 'thresholding', 'lower_order_final'} was not found in config. Values will be initialized to default values.
Loading unet.
Steps: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 100/100 [02:37<00:00,  1.57s/it, loss=0.0117, lr=0.0001]

pacman100 commented 11 months ago

Also, as a side note, this script breakdown when using tensorboard because the config now has lists as values which breaks the tensorboard logging. Notice, 'token_abstraction': ['TOK'].

{'pretrained_model_name_or_path': 'stabilityai/stable-diffusion-xl-base-1.0', 'pretrained_vae_model_name_or_path': 'madebyollin/sdxl-vae-fp16-fix', 'revision': None, 'dataset_name': 'LinoyTsaban/3d_icon', 'dataset_config_name': None, 'instance_data_dir': None, 'cache_dir': None, 'image_column': 'image', 'caption_column': None, 'repeats': 1, 'class_data_dir': None, 'instance_prompt': 'a <s0><s1> icon', 'token_abstraction': ['TOK'], 'num_new_tokens_per_abstraction': 2, 'class_prompt': None, 'validation_prompt': None, 'num_validation_images': 4, 'validation_epochs': 50, 'with_prior_preservation': False, 'prior_loss_weight': 1.0, 'num_class_images': 100, 'output_dir': '3d_icon_SDXL_LoRA', 'seed': 0, 'resolution': 1024, 'crops_coords_top_left_h': 0, 'crops_coords_top_left_w': 0, 'center_crop': False, 'train_text_encoder': False, 'train_batch_size': 1, 'sample_batch_size': 4, 'num_train_epochs': 5, 'max_train_steps': 100, 'checkpointing_steps': 2000, 'checkpoints_total_limit': None, 'resume_from_checkpoint': None, 'gradient_accumulation_steps': 1, 'gradient_checkpointing': True, 'learning_rate': 0.0001, 'text_encoder_lr': 5e-06, 'scale_lr': False, 'lr_scheduler': 'constant', 'snr_gamma': None, 'lr_warmup_steps': 0, 'lr_num_cycles': 1, 'lr_power': 1.0, 'dataloader_num_workers': 0, 'train_text_encoder_ti': True, 'train_text_encoder_ti_frac': 0.5, 'train_text_encoder_frac': 1.0, 'optimizer': 'adamW', 'use_8bit_adam': False, 'adam_beta1': 0.9, 'adam_beta2': 0.999, 'prodigy_beta3': None, 'prodigy_decouple': True, 'adam_weight_decay': 0.0001, 'adam_weight_decay_text_encoder': None, 'adam_epsilon': 1e-08, 'prodigy_use_bias_correction': True, 'prodigy_safeguard_warmup': True, 'max_grad_norm': 1.0, 'push_to_hub': False, 'hub_token': None, 'hub_model_id': None, 'logging_dir': 'logs', 'allow_tf32': False, 'report_to': 'tensorboard', 'mixed_precision': 'fp16', 'prior_generation_precision': None, 'local_rank': -1, 'enable_xformers_memory_efficient_attention': False, 'rank': 4}
Traceback (most recent call last):
  File "/raid/sourab/temp/issues/accelerate/train_dreambooth_lora_sdxl_advanced.py", line 1987, in <module>
    main(args)
  File "/raid/sourab/temp/issues/accelerate/train_dreambooth_lora_sdxl_advanced.py", line 1570, in main
    accelerator.init_trackers("dreambooth-lora-sd-xl", config=vars(args))
  File "/raid/sourab/accelerate/src/accelerate/accelerator.py", line 617, in _inner
    return PartialState().on_main_process(function)(*args, **kwargs)
  File "/raid/sourab/accelerate/src/accelerate/accelerator.py", line 2335, in init_trackers
    tracker.store_init_configuration(config)
  File "/raid/sourab/accelerate/src/accelerate/tracking.py", line 75, in execute_on_main_process
    return PartialState().on_main_process(function)(self, *args, **kwargs)
  File "/raid/sourab/accelerate/src/accelerate/tracking.py", line 207, in store_init_configuration
    self.writer.add_hparams(values, metric_dict={})
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 332, in add_hparams
    exp, ssi, sei = hparams(hparam_dict, metric_dict, hparam_domain_discrete)
  File "/raid/sourab/miniconda3/envs/hf/lib/python3.10/site-packages/torch/utils/tensorboard/summary.py", line 281, in hparams
    raise ValueError(
ValueError: value should be one of int, float, str, bool, or torch.Tensor

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / accelerate

Trouble training in fp16 when modifying the optimizer mid-training #2193