Closed mnslarcher closed 1 year ago
Could you try:
diffusers
. You can ensure by first uninstalling diffusers
and then reinstalling it with pip install git+https://github.com/huggingface/diffusers/
. Also, could you try enabling gradient_checkpointing
and enable_xformers_memory_efficient_attention()
to help prevent the OOM?
Thanks @sayakpaul I'll try all the suggestions tomorrow. I'm pretty sure I'm installing from the source given the diffusers-cli output, but I'll try what you suggest.
As for xformers, does it also make sense with torch 2 I'm using? Or are you suggesting switching to torch 1?
I tried a similar configuration (either with enable_xformers_memory_efficient_attention enabled, --mixed_precision="fp16" --use_8bit_adam --gradient_checkpointing enabled) and I ran in CUDA out of memory with a T4 16Gb: !accelerate launch train_text_to_image_lora_sdxl.py \ --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \ --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \ --dataset_name="$INSTANCE_DIR_PARSED" \ --caption_column="text" \ --resolution=1024 \ --train_batch_size=1 \ --num_train_epochs=2 \ --checkpointing_steps=700000 \ --learning_rate=1e-04 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --seed=42 \ --output_dir="$OUTPUT_DIR" \ --enable_xformers_memory_efficient_attention \ --gradient_checkpointing \ --mixed_precision="fp16" \ --use_8bit_adam
The issue looks related to accelerate v 0.22.0 (published some hours ago). With accelerate==0.21.0, the train finish correctly. Same problem with dreambooth training.
Thanks for sharing. Ccing @muellerzr for https://github.com/huggingface/diffusers/issues/4736#issuecomment-1690786925.
My version of accelerate is fixed at 0.21.0:
channels:
- defaults
dependencies:
- nb_conda_kernels
- ipykernel
- jupyter
- pip
- python=3.10
- pip:
- accelerate==0.21.0
- "black[jupyter]==23.7.0"
- datasets==2.14.4
- git+https://github.com/huggingface/diffusers
- ftfy==6.1.1
- gradio==3.40.1
- isort==5.12.0
- Jinja2==3.1.2
- tensorboard==2.14.0
- torch==2.0.1
- torchvision==0.15.2
- transformers==4.31.0
- wandb==0.15.8
I will try the suggestions, but honestly it seems strange to me that I underutilize resources all the time and then go OOM in testing, today anyway I will do some testing and report here the results
Quick live report: I'm making changes one at a time. Here's the current setup:
export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"
export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
accelerate launch train_text_to_image_lora_sdxl.py \
--pretrained_model_name_or_path=$MODEL_NAME \
--pretrained_vae_model_name_or_path=$VAE_NAME \
--dataset_name=$DATASET_NAME \
--caption_column="text" \
--resolution=1024 \
--random_flip \
--train_batch_size=1 \
--num_train_epochs=2 \
--gradient_accumulation_steps=1 \
--checkpointing_steps=500 \
--learning_rate=1e-04 \
--lr_scheduler="constant" \
--lr_warmup_steps=0 \
--dataloader_num_workers=0 \
--seed=42 \
--output_dir="sd-pokemon-model-lora-sdxl-txt" \
--train_text_encoder \
--validation_prompt="cute dragon creature" \
--report_to="wandb" \
--mixed_precision="fp16" \
--rank=4
The improved VAE-f16 fixes the issue of black images. I'd also mention this in the LoRA README.md, thanks!
I'm observing suspicious GPU memory behavior. It seems there should be a way to avoid almost reaching OOM during validation/testing while maintaining memory usage around 67% during training. Here's the GPU memory behavior chart for reference:
By the way, as you can see I am almost OOM now, whereas before I was going OOM, so the VAE update "fixes the problem". However, I think the real problem is still there, because I can't fully utilize the GPU if the memory increases by 50% during validation and testing.
The improved VAE-f16 fixes the issue of black images. I'd also mention this in the LoRA README.md, thanks!
Feel free to drop a PR.
By the way, as you can see I am almost OOM now, whereas before I was going OOM, so the VAE update "fixes the problem". However, I think the real problem is still there, because I can't fully utilize the GPU if the memory increases by 50% during validation and testing.
We clear the pipeline during validation and testing:
Maybe it could be made better if we reuse the text encoders during validation and text encoder which I think we're already doing. Additionally, I would suggest enabling xformers even when using PT 2.0 as it tends to perform slightly better than SDPA.
Feel free to drop a PR.
I will do it!
We clear the pipeline during validation and testing:
Yes, that is working in fact after that the memory goes down again to 67%, the problem is just before it, I need to explore it further
Maybe it could be made better if we reuse the text encoders during validation and text encoder which I think we're already doing.
Yes, from here it seems is already like this...
text_encoder=accelerator.unwrap_model(text_encoder_one),
text_encoder_2=accelerator.unwrap_model(text_encoder_two),
Additionally, I would suggest enabling xformers even when using PT 2.0 as it tends to perform slightly better than SDPA.
Thanks for the suggestion!
@sayakpaul, I believe I've found a place where we're consuming a significant amount of memory.
In this section: https://github.com/huggingface/diffusers/blob/cdacd8f1ddaf729f30c9be6fb405c76ae8d1c490/examples/text_to_image/train_text_to_image_lora_sdxl.py#L1187
We're converting the unet and text encoders to torch.float32.
Afterwards, we create a new pipeline, reusing only the VAE. At this point, it seems to me that we have 2 unet and 4 text encoders in memory, with half of them in fp32 format. Could this be the case?
This section seems to be responsible for the most significant memory consumption during the entire execution. Memory usage is substantial enough to almost cause an out-of-memory (OOM) situation, as observed from both nvidia-smi and the torch.cuda.memory_allocated()/torch.cuda.max_memory_allocated() function (with RTX 4090).
Furthermore, I've noticed another thing I don't understand. While monitoring nvidia-smi, I've observed a 2k MB increase in memory usage during this inference: https://github.com/huggingface/diffusers/blob/cdacd8f1ddaf729f30c9be6fb405c76ae8d1c490/examples/text_to_image/train_text_to_image_lora_sdxl.py#L1160
However, attempting to measure this increase using torch.cuda.max_memory_allocated() doesn't show the same level of growth, is actually lower; probably I don't know some important thing about CUDA memory management.
it happens towards the end of training so, I wouldn't be concerned about it.
it happens towards the end of training so, I wouldn't be concerned about it.
After investigating a bit more, I can confirm that the issue isn't just at the end of training. It actually arises every time we save the model (I wasn't checkpointing during training while debugging). The problem slightly differs between the final save and when we checkpoint using the accelerator:
If we can save without first converting to float32, we stay under 15k MB, whereas if we don't, we reach 21.5k MB:
unet = accelerator.unwrap_model(unet)
# unet = unet.to(torch.float32)
unet_lora_layers = unet_attn_processors_state_dict(unet)
if args.train_text_encoder:
text_encoder_one = accelerator.unwrap_model(text_encoder_one)
text_encoder_lora_layers = text_encoder_lora_state_dict(
text_encoder_one # .to(torch.float32)
)
text_encoder_two = accelerator.unwrap_model(text_encoder_two)
text_encoder_2_lora_layers = text_encoder_lora_state_dict(
text_encoder_two # .to(torch.float32)
)
Do we need to save in float32 even when we train in float16?
Also, we're recreating the VAE here, but I think it's not needed and we can delete the unet/text encoders to free up memory. However, it doesn't seem to make a big difference.
The second issue arises when we save checkpoints during training, using accelerator.save_state(save_path).
If I execute:
initial_memory = torch.cuda.memory_allocated()
torch.cuda.reset_peak_memory_stats()
accelerator.save_state(save_path)
final_memory = torch.cuda.memory_allocated()
memory_consumed = final_memory - initial_memory
peak_memory = torch.cuda.max_memory_allocated()
logger.info(f"Initial memory: {initial_memory / 1024**2:.2f} MB")
logger.info(f"Final memory: {final_memory / 1024**2:.2f} MB")
logger.info( f"Memory consumed: {memory_consumed / 1024**2:.2f} MB")
logger.info(f"Peak consumed: {peak_memory / 1024**2:.2f} MB")
The output is as follows:
08/25/2023 15:08:07 - INFO - __main__ - Initial memory: 6880.67 MB
08/25/2023 15:08:07 - INFO - __main__ - Final memory: 6880.67 MB
08/25/2023 15:08:07 - INFO - __main__ - Memory consumed: 0.00 MB
08/25/2023 15:08:07 - INFO - __main__ - Peak consumed: 19919.24 MB
If I perform the same operation inside the save_model_hook, I observe:
08/25/2023 14:40:47 - INFO - __main__ - Initial memory: 19919.24 MB
08/25/2023 14:40:47 - INFO - __main__ - Final memory: 6880.67 MB
08/25/2023 14:40:47 - INFO - __main__ - Memory consumed: -13038.57 MB
08/25/2023 14:40:47 - INFO - __main__ - Peak consumed: 19919.24 MB
Thus, there seems to be a considerable memory increase within save_state before calling the hook. I need to further analyze this to identify the specific cause, which might be similar to what I mentioned earlier, although I'm not entirely certain.
Let me know if it makes sense that I dive deeper into this analysis, or if you believe this behavior is to be expected.
If we can isolate which part causes the most amount of spike in memory occupation, that would be helpful.
Also, we're recreating the VAE here, but I think it's not needed and we can delete the unet/text encoders to free up memory. However, it doesn't seem to make a big difference.
We need to keep this until the training completes actually because all of them are needed during training.
I am not entirely sure but a significant part of the memory could be saved if we precompute the VAE encodings and text embeddings (when text encoders are not trained). However, for LoRA training this seems like an overkilled endeavor to me honestly.
@mnslarcher are you using the latest accelerate? (0.22.0). We recently fixed accelerate un-casting from mixed precision which was leading to OOM during model saving: https://github.com/huggingface/accelerate/pull/1868
If we can isolate which part causes the most amount of spike in memory occupation, that would be helpful.
Good. If it's helpful, I believe I'll have some additional time in the next few days to explore this further. This will also serve as an exercise to better understand the internal workings of the accelerator. [probably not needed now that I read the message of Zach]
We need to keep this until the training completes actually because all of them are needed during training.
Here I'm referring to the final part, where we create another time the VAE even if we still have the old one and we keep unet and text encoder even if we don't use them anymore but we create new ones with the final inference pipeline.
@muellerzr Oh, thanks! No, I'm still using version 0.21.0. This might be the issue with the checkpointing part. Great!
So, if we can also avoid casting to float32 in the final save – which doesn't use Accelerate – this script can run on a card with less than 16GB of memory. Additionally, if version 0.22.0 fixes the problem, perhaps it would be a good idea to increase the minimum required version in the requirements? Or would asking for >= 0.22.0 make it too inflexible?
Tomorrow I'll try with 0.22.0
Here I'm referring to the final part, where we create another time the VAE even if we still have the old one and we keep unet and text encoder even if we don't use them anymore but we create new ones with the final inference pipeline.
Maybe worth discussing in a PR.
I think avoid the float32 casting might just do the trick of OOMs but we also need to be careful to not hurt the numerical stability.
Additionally, if version 0.22.0 fixes the problem, perhaps it would be a good idea to increase the minimum required version in the requirements? Or would asking for >= 0.22.0 make it too inflexible?
Let's first explore the points above and then we can consider it.
Here I'm referring to the final part, where we create another time the VAE even if we still have the old one and we keep unet and text encoder even if we don't use them anymore but we create new ones with the final inference pipeline.
Maybe worth discussing in a PR.
I think avoid the float32 casting might just do the trick of OOMs but we also need to be careful to not hurt the numerical stability.
Additionally, if version 0.22.0 fixes the problem, perhaps it would be a good idea to increase the minimum required version in the requirements? Or would asking for >= 0.22.0 make it too inflexible?
Let's first explore the points above and then we can consider it.
Good, I can open a PR this weekend. I'm considering trying to make this work with the modifications above on systems with less than 16GB. After that, I'll check if loading the model saved this way produces reasonable images. Unfortunately, I don't have other ideas on how to test numerical stability. If you could suggest some tests here or on the PR, I'll try to do them.
I'm considering trying to make this work with the modifications above on systems with less than 16GB.
Sure, let's also ensure things don't break for systems having cards with more memory (obvious case but just to be sure).
Unfortunately, I don't have other ideas on how to test numerical stability. If you could suggest some tests here or on the PR, I'll try to do them.
I think we need to be qualitative here as that is probably the easiest.
I'm considering trying to make this work with the modifications above on systems with less than 16GB.
Sure, let's also ensure things don't break for systems having cards with more memory (obvious case but just to be sure).
Unfortunately, I don't have other ideas on how to test numerical stability. If you could suggest some tests here or on the PR, I'll try to do them.
I think we need to be qualitative here as that is probably the easiest.
Sure, I'll test on my 4090 that have more memory but I think the script will stay <16GB, let's see. I'll run a few qualitative tests with and without float32 and report the results in the PR.
Describe the bug
I encountered two distinct issues while attempting to run the
lambdalabs/pokemon-blip-captions
example of train_text_to_image_lora_sdxl.py on an RTX 4090, utilizing bf16.Problem 1: RuntimeWarning and Image Processing:
Problem 2: CUDA Out of Memory Error:
Despite the GPU memory usage during training consistently remaining at 67%, I also encounter a CUDA out-of-memory issue after the training concludes:
The error message is as follows:
Hypothesis:
I suspect that memory might not be fully released before the test inference step. Could it be?
I intend to investigate this matter further on my own, and I'll provide updates here. If anyone else encounters a solution before I do, kindly share it here as well.
Reproduction
Logs
System Info
OS Name: Ubuntu 22.04.3 LTS GPU: NVIDIA GeForce RTX 4090
diffusers-cli env:
diffusers
version: 0.21.0.dev0enviroment.yml (conda):
Who can help?
@sayakpaul