huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.2k stars 5.4k forks source link

Fine-tuning with stable diffusion with UNet3DConditionModel out-of-memory #6536

Closed dahui-y closed 10 months ago

dahui-y commented 10 months ago

Describe the bug

I tried to train SD based on the train_text_to_image.py, but I replaced UNet2D with UNet3D. During this training, I used 8 GPUs (i.e., 8 tesla v100) to train the model with 1 batch size. However, the job is failed with the error "out of memory". So, do you have any suggestions for the optimization? or Try to use more powerful GPU to train this model?

The part of the error is shown below, OutOfMemoryErrortorch.cuda^torch.cuda torch.cuda: .^. .torch.cudatorch.cudaCUDA out of memory. Tried to allocate 50.00 MiB. GPU 6 has a total capacty of 31.75 GiB of which 44.75 MiB is free. Including non-PyTorch memory, this process has 31.70 GiB memory in use. Of the allocated memory 29.24 GiB is allocated by PyTorch, and 731.11 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONFOutOfMemoryError

Reproduction

The command for running train_text_to_image.py:

export MODEL_NAME="/jmain02/home/J2AD006/jxb03/ddy08-jxb03/stable-diffusion-v1-4" echo $CUDA_VISIBLE_DEVICES accelerate launch --mixed_precision="fp16" --multi_gpu /jmain02/home/J2AD006/jxb03/ddy08-jxb03/diffusers/examples/text_to_image/train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --train_data_dir="/jmain02/home/J2AD006/jxb03/ddy08-jxb03/dataset-xxx" \ --use_ema \ --train_batch_size=1 \ --dataloader_num_workers=40 --gradient_accumulation_steps=4 \ --gradient_checkpointing \ --max_train_steps=15000 \ --learning_rate=1e-05 \ --max_grad_norm=1 \ --lr_scheduler="constant" --lr_warmup_steps=0 \ --output_dir="output_3D"

The accelerate env is shown below,

Logs

No response

System Info

Who can help?

@sayakpaul @patrickvonplaten

sayakpaul commented 10 months ago

This is better asked as a question on https://github.com/huggingface/diffusers/discussions as this directly doesn't concern the library. Could you please open one there?