Fine-tuning with stable diffusion with UNet3DConditionModel out-of-memory

Describe the bug

I tried to train SD based on the train_text_to_image.py, but I replaced UNet2D with UNet3D. During this training, I used 8 GPUs (i.e., 8 tesla v100) to train the model with 1 batch size. However, the job is failed with the error "out of memory". So, do you have any suggestions for the optimization? or Try to use more powerful GPU to train this model?

The part of the error is shown below, OutOfMemoryErrortorch.cuda^torch.cuda torch.cuda: .^. .torch.cudatorch.cudaCUDA out of memory. Tried to allocate 50.00 MiB. GPU 6 has a total capacty of 31.75 GiB of which 44.75 MiB is free. Including non-PyTorch memory, this process has 31.70 GiB memory in use. Of the allocated memory 29.24 GiB is allocated by PyTorch, and 731.11 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFOutOfMemoryError

Reproduction

The command for running train_text_to_image.py:

export MODEL_NAME="/jmain02/home/J2AD006/jxb03/ddy08-jxb03/stable-diffusion-v1-4" echo $CUDA_VISIBLE_DEVICES accelerate launch --mixed_precision="fp16" --multi_gpu /jmain02/home/J2AD006/jxb03/ddy08-jxb03/diffusers/examples/text_to_image/train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --train_data_dir="/jmain02/home/J2AD006/jxb03/ddy08-jxb03/dataset-xxx" \ --use_ema \ --train_batch_size=1 \ --dataloader_num_workers=40 --gradient_accumulation_steps=4 \ --gradient_checkpointing \ --max_train_steps=15000 \ --learning_rate=1e-05 \ --max_grad_norm=1 \ --lr_scheduler="constant" --lr_warmup_steps=0 \ --output_dir="output_3D"

The accelerate env is shown below,

Accelerate version: 0.25.0
- Platform: Linux-3.10.0-1160.80.1.el7.x86_64-x86_64-with-glibc2.17
- Python version: 3.11.5
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.1.2+cu118 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 503.30 GB
- GPU type: Tesla V100-SXM2-32GB-LS
- Accelerate default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: MULTI_GPU
- mixed_precision: fp16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 1
- gpu_ids: all
- rdzv_backend: static
- same_network: True
- main_training_function: main
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []

Logs

No response

System Info

diffusers version: 0.25.0.dev0
Platform: Linux-3.10.0-1160.80.1.el7.x86_64-x86_64-with-glibc2.17
Python version: 3.11.5
PyTorch version (GPU?): 2.1.2+cu118 (True)
Huggingface_hub version: 0.20.1
Transformers version: 4.36.2
Accelerate version: 0.25.0
xFormers version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@sayakpaul @patrickvonplaten

huggingface / diffusers