CUDA out of memory and invalid value encountered in cast with train_text_to_image_lora_sdxl.py

Describe the bug

I encountered two distinct issues while attempting to run the lambdalabs/pokemon-blip-captions example of train_text_to_image_lora_sdxl.py on an RTX 4090, utilizing bf16.

Problem 1: RuntimeWarning and Image Processing:

RuntimeWarning: invalid value encountered in cast
images = (images * 255).round().astype("uint8")

Problem 2: CUDA Out of Memory Error:

Despite the GPU memory usage during training consistently remaining at 67%, I also encounter a CUDA out-of-memory issue after the training concludes:

W B Chart 8_23_2023, 12_23_59 PM

The error message is as follows:

hidden_states = hidden_states.to(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.64 GiB total capacity; 20.89 GiB already allocated; 497.75 MiB free; 22.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Hypothesis:

I suspect that memory might not be fully released before the test inference step. Could it be?

I intend to investigate this matter further on my own, and I'll provide updates here. If anyone else encounters a solution before I do, kindly share it here as well.

Reproduction

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --caption_column="text" \
  --resolution=1024 \
  --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=2 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=500 \
  --learning_rate=1e-04 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --dataloader_num_workers=0 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-sdxl-txt" \
  --train_text_encoder \
  --validation_prompt="cute dragon creature" \
  --report_to="wandb" \
  --mixed_precision="bf16" \
  --rank=4

Logs

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

accelerate launch train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$DATASET_NAME \
  --caption_column="text" \
  --resolution=1024 \
  --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=2 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=500 \
  --learning_rate=1e-04 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --dataloader_num_workers=0 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-sdxl-txt" \
  --train_text_encoder \
  --validation_prompt="cute dragon creature" \
  --report_to="wandb" \
  --mixed_precision="bf16" \
  --rank=4
08/23/2023 11:18:30 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: bf16

You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
You are using a model of type clip_text_model to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'clip_sample_range', 'dynamic_thresholding_ratio', 'variance_type', 'thresholding'} was not found in config. Values will be initialized to default values.
{'attention_type'} was not found in config. Values will be initialized to default values.
wandb: Currently logged in as: mnslarcher. Use `wandb login --relogin` to force relogin
wandb: Tracking run with wandb version 0.15.8
wandb: Run data is saved locally in /home/mnslarcher/ai/hands/wandb/run-20230823_111845-ngknp8t5
wandb: Run `wandb offline` to turn off syncing.
wandb: Syncing run bumbling-brook-7
wandb: ⭐️ View project at https://wandb.ai/mnslarcher/text2image-fine-tune
wandb: 🚀 View run at https://wandb.ai/mnslarcher/text2image-fine-tune/runs/ngknp8t5
08/23/2023 11:18:49 - INFO - __main__ - ***** Running training *****
08/23/2023 11:18:49 - INFO - __main__ -   Num examples = 833
08/23/2023 11:18:49 - INFO - __main__ -   Num Epochs = 2
08/23/2023 11:18:49 - INFO - __main__ -   Instantaneous batch size per device = 1
08/23/2023 11:18:49 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 1
08/23/2023 11:18:49 - INFO - __main__ -   Gradient Accumulation steps = 1
08/23/2023 11:18:49 - INFO - __main__ -   Total optimization steps = 1666
Steps:  30%|████████████████████████████████████████▌                                                                                              | 500/1666 [08:05<19:20,  1.00it/s, lr=0.0001, step_loss=0.0274]08/23/2023 11:26:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-500
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/pytorch_lora_weights.safetensors
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/optimizer.bin
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/scheduler.bin
08/23/2023 11:26:55 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-500/random_states_0.pkl
08/23/2023 11:26:55 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-500
Steps:  50%|████████████████████████████████████████████████████████████████████                                                                    | 833/1666 [13:28<13:20,  1.04it/s, lr=0.0001, step_loss=0.134]08/23/2023 11:32:18 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: cute dragon creature.
                                                                                                                                                                                                                  Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                                     | 0/7 [00:00<?, ?it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                                                                  Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                    | 3/7 [00:00<00:00, 27.36it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 62.86it/s]
/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/image_processor.py:65: RuntimeWarning: invalid value encountered in cast
  images = (images * 255).round().astype("uint8")
Steps:  60%|████████████████████████████████████████████████████████████████████████████████▍                                                     | 1000/1666 [17:06<10:46,  1.03it/s, lr=0.0001, step_loss=0.0518]08/23/2023 11:35:55 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1000
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/pytorch_lora_weights.safetensors
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/optimizer.bin
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/scheduler.bin
08/23/2023 11:35:56 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1000/random_states_0.pkl
08/23/2023 11:35:56 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1000
Steps:  90%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋             | 1500/1666 [25:13<02:39,  1.04it/s, lr=0.0001, step_loss=0.0561]08/23/2023 11:44:02 - INFO - accelerate.accelerator - Saving current state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1500
Model weights saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/pytorch_lora_weights.safetensors
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Optimizer state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/optimizer.bin
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Scheduler state saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/scheduler.bin
08/23/2023 11:44:03 - INFO - accelerate.checkpointing - Random states saved in sd-pokemon-model-lora-sdxl-txt/checkpoint-1500/random_states_0.pkl
08/23/2023 11:44:03 - INFO - __main__ - Saved state to sd-pokemon-model-lora-sdxl-txt/checkpoint-1500
Steps: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1666/1666 [27:55<00:00,  1.04it/s, lr=0.0001, step_loss=0.0427]08/23/2023 11:46:45 - INFO - __main__ - Running validation... 
 Generating 4 images with prompt: cute dragon creature.
                                                                                                                                                                                                                  Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                                     | 0/7 [00:00<?, ?it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                                                                  Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                    | 3/7 [00:00<00:00, 27.61it/s]
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 63.49it/s]
/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/image_processor.py:65: RuntimeWarning: invalid value encountered in cast
  images = (images * 255).round().astype("uint8")
Model weights saved in sd-pokemon-model-lora-sdxl-txt/pytorch_lora_weights.safetensors
                                                                                                                                                                                                                  Loaded text_encoder_2 as CLIPTextModelWithProjection from `text_encoder_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                             | 0/7 [00:00<?, ?it/s]
                                                                                                                                                                                                                  Loaded tokenizer as CLIPTokenizer from `tokenizer` subfolder of stabilityai/stable-diffusion-xl-base-1.0.                                                                             | 1/7 [00:00<00:05,  1.13it/s]
Loaded tokenizer_2 as CLIPTokenizer from `tokenizer_2` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
{'attention_type'} was not found in config. Values will be initialized to default values.
Loaded unet as UNet2DConditionModel from `unet` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
                                                                                                                                                                                                                  Loaded scheduler as EulerDiscreteScheduler from `scheduler` subfolder of stabilityai/stable-diffusion-xl-base-1.0.██████▎                                                             | 4/7 [00:03<00:02,  1.05it/s]
Loaded text_encoder as CLIPTextModel from `text_encoder` subfolder of stabilityai/stable-diffusion-xl-base-1.0.
Loading pipeline components...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:04<00:00,  1.71it/s]
Loading unet.ine components...:  86%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                    | 6/7 [00:04<00:00,  1.66it/s]
Loading text_encoder.
Loading text_encoder_2.
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00,  4.31it/s]
Traceback (most recent call last):█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 25/25 [00:05<00:00,  4.31it/s]
  File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1505, in <module>
    main(args)
  File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1458, in main
    images = [
  File "/home/mnslarcher/ai/hands/train_text_to_image_lora_sdxl.py", line 1459, in <listcomp>
    pipeline(
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl.py", line 845, in __call__
    image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 270, in decode
    decoded = self._decode(z).sample
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/autoencoder_kl.py", line 257, in _decode
    dec = self.decoder(z)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/vae.py", line 271, in forward
    sample = up_block(sample, latent_embeds)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/unet_2d_blocks.py", line 2334, in forward
    hidden_states = upsampler(hidden_states)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/diffusers/models/resnet.py", line 164, in forward
    hidden_states = hidden_states.to(dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.64 GiB total capacity; 20.89 GiB already allocated; 497.75 MiB free; 22.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
wandb: Waiting for W&B process to finish... (failed 1). Press Control-C to abort syncing.
wandb: | 0.042 MB of 0.042 MB uploaded (0.000 MB deduped)
wandb: Run history:
wandb: train_loss ▂▆▂▁▄▃▅▄▂▃▁▂▁▁▄▂▁▄▁▃▅▁▂▆▁▁▅▄▃▁▄▆▄█▅▁▇▂▅▁
wandb: 
wandb: Run summary:
wandb: train_loss 0.04268
wandb: 
wandb: 🚀 View run bumbling-brook-7 at: https://wandb.ai/mnslarcher/text2image-fine-tune/runs/ngknp8t5
wandb: Synced 6 W&B file(s), 2 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Find logs at: ./wandb/run-20230823_111845-ngknp8t5/logs
Traceback (most recent call last):
  File "/home/mnslarcher/anaconda3/envs/hands/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/launch.py", line 979, in launch_command
    simple_launcher(args)
  File "/home/mnslarcher/anaconda3/envs/hands/lib/python3.10/site-packages/accelerate/commands/launch.py", line 628, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/mnslarcher/anaconda3/envs/hands/bin/python', 'train_text_to_image_lora_sdxl.py', '--pretrained_model_name_or_path=stabilityai/stable-diffusion-xl-base-1.0', '--dataset_name=lambdalabs/pokemon-blip-captions', '--caption_column=text', '--resolution=1024', '--random_flip', '--train_batch_size=1', '--num_train_epochs=2', '--gradient_accumulation_steps=1', '--checkpointing_steps=500', '--learning_rate=1e-04', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--dataloader_num_workers=0', '--seed=42', '--output_dir=sd-pokemon-model-lora-sdxl-txt', '--train_text_encoder', '--validation_prompt=cute dragon creature', '--report_to=wandb', '--mixed_precision=bf16', '--rank=4']' returned non-zero exit status 1.

System Info

OS Name: Ubuntu 22.04.3 LTS GPU: NVIDIA GeForce RTX 4090

diffusers-cli env:

diffusers version: 0.21.0.dev0
Platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version (GPU?): 2.0.1+cu117 (True)
Huggingface_hub version: 0.16.4
Transformers version: 4.31.0
Accelerate version: 0.21.0
xFormers version: not installed
Using GPU in script?: YES
Using distributed or parallel set-up in script?: NO

enviroment.yml (conda):

name: myenv
channels:
  - defaults
dependencies:
  - nb_conda_kernels
  - ipykernel
  - jupyter
  - pip
  - python=3.10
  - pip:
    - accelerate==0.21.0
    - "black[jupyter]==23.7.0"
    - datasets==2.14.4
    - git+https://github.com/huggingface/diffusers
    - ftfy==6.1.1
    - gradio==3.40.1
    - isort==5.12.0
    - Jinja2==3.1.2
    - tensorboard==2.14.0
    - torch==2.0.1
    - torchvision==0.15.2
    - transformers==4.31.0
    - wandb==0.15.8

Who can help?

@sayakpaul

Could you try:

Switching the VAE to https://huggingface.co/madebyollin/sdxl-vae-fp16-fix. More info: https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md#specifying-a-better-vae.
Ensuring that you're using a sourced installation diffusers. You can ensure by first uninstalling diffusers and then reinstalling it with pip install git+https://github.com/huggingface/diffusers/.

Also, could you try enabling gradient_checkpointing and enable_xformers_memory_efficient_attention() to help prevent the OOM?

Thanks @sayakpaul I'll try all the suggestions tomorrow. I'm pretty sure I'm installing from the source given the diffusers-cli output, but I'll try what you suggest.

As for xformers, does it also make sense with torch 2 I'm using? Or are you suggesting switching to torch 1?

I tried a similar configuration (either with enable_xformers_memory_efficient_attention enabled, --mixed_precision="fp16" --use_8bit_adam --gradient_checkpointing enabled) and I ran in CUDA out of memory with a T4 16Gb: !accelerate launch train_text_to_image_lora_sdxl.py \ --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \ --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \ --dataset_name="$INSTANCE_DIR_PARSED" \ --caption_column="text" \ --resolution=1024 \ --train_batch_size=1 \ --num_train_epochs=2 \ --checkpointing_steps=700000 \ --learning_rate=1e-04 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --seed=42 \ --output_dir="$OUTPUT_DIR" \ --enable_xformers_memory_efficient_attention \ --gradient_checkpointing \ --mixed_precision="fp16" \ --use_8bit_adam

The issue looks related to accelerate v 0.22.0 (published some hours ago). With accelerate==0.21.0, the train finish correctly. Same problem with dreambooth training.

My version of accelerate is fixed at 0.21.0:

channels:
  - defaults
dependencies:
  - nb_conda_kernels
  - ipykernel
  - jupyter
  - pip
  - python=3.10
  - pip:
    - accelerate==0.21.0
    - "black[jupyter]==23.7.0"
    - datasets==2.14.4
    - git+https://github.com/huggingface/diffusers
    - ftfy==6.1.1
    - gradio==3.40.1
    - isort==5.12.0
    - Jinja2==3.1.2
    - tensorboard==2.14.0
    - torch==2.0.1
    - torchvision==0.15.2
    - transformers==4.31.0
    - wandb==0.15.8

I will try the suggestions, but honestly it seems strange to me that I underutilize resources all the time and then go OOM in testing, today anyway I will do some testing and report here the results

Quick live report: I'm making changes one at a time. Here's the current setup:

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"

accelerate launch train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --pretrained_vae_model_name_or_path=$VAE_NAME \
  --dataset_name=$DATASET_NAME \
  --caption_column="text" \
  --resolution=1024 \
  --random_flip \
  --train_batch_size=1 \
  --num_train_epochs=2 \
  --gradient_accumulation_steps=1 \
  --checkpointing_steps=500 \
  --learning_rate=1e-04 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --dataloader_num_workers=0 \
  --seed=42 \
  --output_dir="sd-pokemon-model-lora-sdxl-txt" \
  --train_text_encoder \
  --validation_prompt="cute dragon creature" \
  --report_to="wandb" \
  --mixed_precision="fp16" \
  --rank=4

The improved VAE-f16 fixes the issue of black images. I'd also mention this in the LoRA README.md, thanks!

I'm observing suspicious GPU memory behavior. It seems there should be a way to avoid almost reaching OOM during validation/testing while maintaining memory usage around 67% during training. Here's the GPU memory behavior chart for reference: W B Chart 8_24_2023, 9_51_45 AM

By the way, as you can see I am almost OOM now, whereas before I was going OOM, so the VAE update "fixes the problem". However, I think the real problem is still there, because I can't fully utilize the GPU if the memory increases by 50% during validation and testing.

The improved VAE-f16 fixes the issue of black images. I'd also mention this in the LoRA README.md, thanks!

Feel free to drop a PR.

By the way, as you can see I am almost OOM now, whereas before I was going OOM, so the VAE update "fixes the problem". However, I think the real problem is still there, because I can't fully utilize the GPU if the memory increases by 50% during validation and testing.

We clear the pipeline during validation and testing:

https://github.com/huggingface/diffusers/blob/24c5e7708bb75076dd8e79ccaea195640555f945/examples/text_to_image/train_text_to_image_lora_sdxl.py#L1184C1-L1185C41

Maybe it could be made better if we reuse the text encoders during validation and text encoder which I think we're already doing. Additionally, I would suggest enabling xformers even when using PT 2.0 as it tends to perform slightly better than SDPA.

Feel free to drop a PR.

I will do it!

We clear the pipeline during validation and testing:

Yes, that is working in fact after that the memory goes down again to 67%, the problem is just before it, I need to explore it further

Maybe it could be made better if we reuse the text encoders during validation and text encoder which I think we're already doing.

Yes, from here it seems is already like this...

text_encoder=accelerator.unwrap_model(text_encoder_one),
text_encoder_2=accelerator.unwrap_model(text_encoder_two),

Additionally, I would suggest enabling xformers even when using PT 2.0 as it tends to perform slightly better than SDPA.

Thanks for the suggestion!

@sayakpaul, I believe I've found a place where we're consuming a significant amount of memory.

In this section: https://github.com/huggingface/diffusers/blob/cdacd8f1ddaf729f30c9be6fb405c76ae8d1c490/examples/text_to_image/train_text_to_image_lora_sdxl.py#L1187

We're converting the unet and text encoders to torch.float32.

Afterwards, we create a new pipeline, reusing only the VAE. At this point, it seems to me that we have 2 unet and 4 text encoders in memory, with half of them in fp32 format. Could this be the case?

This section seems to be responsible for the most significant memory consumption during the entire execution. Memory usage is substantial enough to almost cause an out-of-memory (OOM) situation, as observed from both nvidia-smi and the torch.cuda.memory_allocated()/torch.cuda.max_memory_allocated() function (with RTX 4090).

Furthermore, I've noticed another thing I don't understand. While monitoring nvidia-smi, I've observed a 2k MB increase in memory usage during this inference: https://github.com/huggingface/diffusers/blob/cdacd8f1ddaf729f30c9be6fb405c76ae8d1c490/examples/text_to_image/train_text_to_image_lora_sdxl.py#L1160

However, attempting to measure this increase using torch.cuda.max_memory_allocated() doesn't show the same level of growth, is actually lower; probably I don't know some important thing about CUDA memory management.

it happens towards the end of training so, I wouldn't be concerned about it.

it happens towards the end of training so, I wouldn't be concerned about it.

After investigating a bit more, I can confirm that the issue isn't just at the end of training. It actually arises every time we save the model (I wasn't checkpointing during training while debugging). The problem slightly differs between the final save and when we checkpoint using the accelerator:

Final saving

If we can save without first converting to float32, we stay under 15k MB, whereas if we don't, we reach 21.5k MB:

        unet = accelerator.unwrap_model(unet)
        # unet = unet.to(torch.float32)
        unet_lora_layers = unet_attn_processors_state_dict(unet)

        if args.train_text_encoder:
            text_encoder_one = accelerator.unwrap_model(text_encoder_one)
            text_encoder_lora_layers = text_encoder_lora_state_dict(
                text_encoder_one  # .to(torch.float32)
            )
            text_encoder_two = accelerator.unwrap_model(text_encoder_two)
            text_encoder_2_lora_layers = text_encoder_lora_state_dict(
                text_encoder_two  # .to(torch.float32)
            )

Do we need to save in float32 even when we train in float16?

Also, we're recreating the VAE here, but I think it's not needed and we can delete the unet/text encoders to free up memory. However, it doesn't seem to make a big difference.

Saving Checkpoints (Accelerator)

The second issue arises when we save checkpoints during training, using accelerator.save_state(save_path).

If I execute:

initial_memory = torch.cuda.memory_allocated()
torch.cuda.reset_peak_memory_stats()

accelerator.save_state(save_path)

final_memory = torch.cuda.memory_allocated()
memory_consumed = final_memory - initial_memory
peak_memory = torch.cuda.max_memory_allocated()
logger.info(f"Initial memory: {initial_memory / 1024**2:.2f} MB")
logger.info(f"Final memory: {final_memory / 1024**2:.2f} MB")
logger.info( f"Memory consumed: {memory_consumed / 1024**2:.2f} MB")
logger.info(f"Peak consumed: {peak_memory / 1024**2:.2f} MB")

The output is as follows:

08/25/2023 15:08:07 - INFO - __main__ - Initial memory: 6880.67 MB
08/25/2023 15:08:07 - INFO - __main__ - Final memory: 6880.67 MB
08/25/2023 15:08:07 - INFO - __main__ - Memory consumed: 0.00 MB
08/25/2023 15:08:07 - INFO - __main__ - Peak consumed: 19919.24 MB

If I perform the same operation inside the save_model_hook, I observe:

08/25/2023 14:40:47 - INFO - __main__ - Initial memory: 19919.24 MB
08/25/2023 14:40:47 - INFO - __main__ - Final memory: 6880.67 MB
08/25/2023 14:40:47 - INFO - __main__ - Memory consumed: -13038.57 MB
08/25/2023 14:40:47 - INFO - __main__ - Peak consumed: 19919.24 MB

Thus, there seems to be a considerable memory increase within save_state before calling the hook. I need to further analyze this to identify the specific cause, which might be similar to what I mentioned earlier, although I'm not entirely certain.

Let me know if it makes sense that I dive deeper into this analysis, or if you believe this behavior is to be expected.

If we can isolate which part causes the most amount of spike in memory occupation, that would be helpful.

Also, we're recreating the VAE here, but I think it's not needed and we can delete the unet/text encoders to free up memory. However, it doesn't seem to make a big difference.

We need to keep this until the training completes actually because all of them are needed during training.

I am not entirely sure but a significant part of the memory could be saved if we precompute the VAE encodings and text embeddings (when text encoders are not trained). However, for LoRA training this seems like an overkilled endeavor to me honestly.

@mnslarcher are you using the latest accelerate? (0.22.0). We recently fixed accelerate un-casting from mixed precision which was leading to OOM during model saving: https://github.com/huggingface/accelerate/pull/1868

If we can isolate which part causes the most amount of spike in memory occupation, that would be helpful.

Good. If it's helpful, I believe I'll have some additional time in the next few days to explore this further. This will also serve as an exercise to better understand the internal workings of the accelerator. [probably not needed now that I read the message of Zach]

We need to keep this until the training completes actually because all of them are needed during training.

Here I'm referring to the final part, where we create another time the VAE even if we still have the old one and we keep unet and text encoder even if we don't use them anymore but we create new ones with the final inference pipeline.

@muellerzr Oh, thanks! No, I'm still using version 0.21.0. This might be the issue with the checkpointing part. Great!

So, if we can also avoid casting to float32 in the final save – which doesn't use Accelerate – this script can run on a card with less than 16GB of memory. Additionally, if version 0.22.0 fixes the problem, perhaps it would be a good idea to increase the minimum required version in the requirements? Or would asking for >= 0.22.0 make it too inflexible?

Tomorrow I'll try with 0.22.0

Here I'm referring to the final part, where we create another time the VAE even if we still have the old one and we keep unet and text encoder even if we don't use them anymore but we create new ones with the final inference pipeline.

Maybe worth discussing in a PR.

I think avoid the float32 casting might just do the trick of OOMs but we also need to be careful to not hurt the numerical stability.

Additionally, if version 0.22.0 fixes the problem, perhaps it would be a good idea to increase the minimum required version in the requirements? Or would asking for >= 0.22.0 make it too inflexible?

Let's first explore the points above and then we can consider it.

Here I'm referring to the final part, where we create another time the VAE even if we still have the old one and we keep unet and text encoder even if we don't use them anymore but we create new ones with the final inference pipeline.

Maybe worth discussing in a PR.

I think avoid the float32 casting might just do the trick of OOMs but we also need to be careful to not hurt the numerical stability.

Additionally, if version 0.22.0 fixes the problem, perhaps it would be a good idea to increase the minimum required version in the requirements? Or would asking for >= 0.22.0 make it too inflexible?

Let's first explore the points above and then we can consider it.

Good, I can open a PR this weekend. I'm considering trying to make this work with the modifications above on systems with less than 16GB. After that, I'll check if loading the model saved this way produces reasonable images. Unfortunately, I don't have other ideas on how to test numerical stability. If you could suggest some tests here or on the PR, I'll try to do them.

I'm considering trying to make this work with the modifications above on systems with less than 16GB.

Sure, let's also ensure things don't break for systems having cards with more memory (obvious case but just to be sure).

Unfortunately, I don't have other ideas on how to test numerical stability. If you could suggest some tests here or on the PR, I'll try to do them.

I think we need to be qualitative here as that is probably the easiest.

I'm considering trying to make this work with the modifications above on systems with less than 16GB.

Sure, let's also ensure things don't break for systems having cards with more memory (obvious case but just to be sure).

Unfortunately, I don't have other ideas on how to test numerical stability. If you could suggest some tests here or on the PR, I'll try to do them.

I think we need to be qualitative here as that is probably the easiest.

Sure, I'll test on my 4090 that have more memory but I think the script will stay <16GB, let's see. I'll run a few qualitative tests with and without float32 and report the results in the PR.

huggingface / diffusers