huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
25.59k stars 5.29k forks source link

CUDA out of memory while training DreamBooth using AltDiffusion #2089

Closed airkid closed 1 year ago

airkid commented 1 year ago

Describe the bug

Hi I'm running Dreambooth training code. It works good with default stable diffusion v1.4 command. https://github.com/huggingface/diffusers/blob/fc8afa3ab5eb840ab0da5aadb629bf671eef9a39/examples/dreambooth/README.md?plain=1#L103. However when I try AltDiffuision, it reports cuda out of memory error. https://github.com/huggingface/diffusers/blob/fc8afa3ab5eb840ab0da5aadb629bf671eef9a39/examples/dreambooth/README.md?plain=1#L251. It also failed with cuda out of memory error even I use 8G command https://github.com/huggingface/diffusers/blob/fc8afa3ab5eb840ab0da5aadb629bf671eef9a39/examples/dreambooth/README.md?plain=1#L170.

Reproduction

export MODEL_NAME="BAAI/AltDiffusion" export INSTANCE_DIR="/root/DreamBooth/data_diffuser/jingtian" export CLASS_DIR="/root/DreamBooth/data_diffuser/年轻女士" export OUTPUT_DIR="/root/DreamBooth/alt_diffusion_weights/景甜"

accelerate launch --mixed_precision="fp16" train_dreambooth_diffuser.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --class_data_dir=$CLASS_DIR \ --output_dir=$OUTPUT_DIR \ --with_prior_preservation --prior_loss_weight=1.0 \ --instance_prompt="一张#景甜,年轻女士的照片" \ --class_prompt="一张年轻女士的照片" \ --resolution=512 \ --train_batch_size=1 \ --sample_batch_size=1 \ --gradient_accumulation_steps=1 \ --gradient_checkpointing \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --num_class_images=200 \ --max_train_steps=800

Logs

bash DreamBoothDiffuser-Alt8g_jingtian.sh
The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `1`
    `--num_machines` was set to a value of `1`
    `--num_cpu_threads_per_process` was set to `4` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
WARNING:root:WARNING: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop
/root/anaconda3/envs/alt/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory
  warn(f"Failed to load image Python extension: {e}")
BAAI/AltDiffusion
/root/anaconda3/envs/alt/lib/python3.8/site-packages/accelerate/accelerator.py:179: UserWarning: `log_with=tensorboard` was passed but no supported trackers are currently installed.
  warnings.warn(f"`log_with={log_with}` was passed but no supported trackers are currently installed.")
You are using a model of type xlm-roberta to instantiate a model of type . This is not supported for all configurations of models and can yield errors.
{'variance_type', 'prediction_type'} was not found in config. Values will be initialized to default values.
{'mid_block_type', 'num_class_embeds', 'upcast_attention', 'dual_cross_attention', 'resnet_time_scale_shift', 'use_linear_projection', 'class_embed_type', 'only_cross_attention'} was not found in config. Values will be initialized to default values.
Steps:   0%|                                                                                            | 1/800 [00:01<25:05,  1.88s/it, loss=0.0442, lr=5e-6]Traceback (most recent call last):
  File "train_dreambooth_diffuser.py", line 880, in <module>
    main(args)
  File "train_dreambooth_diffuser.py", line 786, in main
    latents = vae.encode(batch["pixel_values"].to(dtype=weight_dtype)).latent_dist.sample()
  File "/root/anaconda3/envs/alt/lib/python3.8/site-packages/diffusers/models/autoencoder_kl.py", line 105, in encode
    h = self.encoder(x)
  File "/root/anaconda3/envs/alt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/envs/alt/lib/python3.8/site-packages/diffusers/models/vae.py", line 105, in forward
    sample = down_block(sample)
  File "/root/anaconda3/envs/alt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/envs/alt/lib/python3.8/site-packages/diffusers/models/unet_2d_blocks.py", line 936, in forward
    hidden_states = resnet(hidden_states, temb=None)
  File "/root/anaconda3/envs/alt/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/anaconda3/envs/alt/lib/python3.8/site-packages/diffusers/models/resnet.py", line 493, in forward
    output_tensor = (input_tensor + hidden_states) / self.output_scale_factor
RuntimeError: CUDA out of memory. Tried to allocate 128.00 MiB (GPU 0; 15.78 GiB total capacity; 14.47 GiB already allocated; 60.75 MiB free; 14.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:   0%|                                                                                            | 1/800 [00:02<27:54,  2.10s/it, loss=0.0442, lr=5e-6]
Traceback (most recent call last):
  File "/root/anaconda3/envs/alt/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/root/anaconda3/envs/alt/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 43, in main
    args.func(args)
  File "/root/anaconda3/envs/alt/lib/python3.8/site-packages/accelerate/commands/launch.py", line 837, in launch_command
    simple_launcher(args)
  File "/root/anaconda3/envs/alt/lib/python3.8/site-packages/accelerate/commands/launch.py", line 354, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/root/anaconda3/envs/alt/bin/python3.8', 'train_dreambooth_diffuser.py', '--pretrained_model_name_or_path=BAAI/AltDiffusion', '--instance_data_dir=/root/DreamBooth/data_diffuser/jingtian', '--class_data_dir=/root/DreamBooth/data_diffuser/年轻女士', '--output_dir=/root/DreamBooth/alt_diffusion_weights/景甜', '--with_prior_preservation', '--prior_loss_weight=1.0', '--instance_prompt=一张#景甜,年轻女士的照片', '--class_prompt=一张年轻女士的照片', '--resolution=512', '--train_batch_size=1', '--sample_batch_size=1', '--gradient_accumulation_steps=1', '--gradient_checkpointing', '--learning_rate=5e-6', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--num_class_images=200', '--max_train_steps=800']' returned non-zero exit status 1.

System Info

libtorch_cuda_cu.so: cannot open shared object file: No such file or directory WARNING:root:WARNING: libtorch_cuda_cu.so: cannot open shared object file: No such file or directory Need to compile C++ extensions to get sparse attention suport. Please run python setup.py build develop

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

patrickvonplaten commented 1 year ago

Hey @airkid,

Note that the text encoder of AltDiffusion is much bigger than the one of Stable Diffusion v1-4: https://huggingface.co/BAAI/AltDiffusion-m9/tree/main/text_encoder SD 1-4 has only 500MB, so it's somewhat expected to see an OOM here. Can you try reducing the batch size etc...?

airkid commented 1 year ago

Hi @patrickvonplaten. Yes I find that. From the repro code my batchsize is setting to 1 now . Is there anyway to avoid OOM? I turned resolution but not help. BTW my GPU is 1 v100 16G

chjose commented 1 year ago

@patrickvonplaten @patil-suraj I am getting very similar error with export MODEL_NAME="stabilityai/stable-diffusion-2". Did change the resolution to 768 and batch size is 1.

Any other workaround? The same script works well with stable diffusion 1.4.

patil-suraj commented 1 year ago

Both (AltDiffusion and sd2.1) have bigger text encoder than sd1.* and sd2.1 has higher resolution which is the reason for OOM. It should fit in 16GB if you enable 8-bit Adam and xformers. These can be enabled using the flags --use_8bit_adam and --enable_xformers_memory_efficient_attention. To install xformers

pip install --pre xformers
chjose commented 1 year ago

Thanks @patil-suraj. Got past the CUDA memory error by installing xformers and adding the --enable_xformers_memory_efficient_attention flag.

Launch Script:

export MODEL_NAME="stabilityai/stable-diffusion-2"
export INSTANCE_DIR="/home/ubuntu/character_images/dreambooth_example"
export OUTPUT_DIR="/home/ubuntu/character_images/model_sd2"

accelerate launch train_dreambooth.py \
  --pretrained_model_name_or_path=$MODEL_NAME  \
  --instance_data_dir=$INSTANCE_DIR \
  --output_dir=$OUTPUT_DIR \
  --instance_prompt="a photo of sks dog" \
  --resolution=768 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=1 \
  --learning_rate=5e-6 \
  --lr_scheduler="constant" \
  --enable_xformers_memory_efficient_attention \
  --lr_warmup_steps=0 \
  --max_train_steps=400
airkid commented 1 year ago

Thanks @patil-suraj! It works for me on 16G v100 now.