cuda out of memory when finetune pokemon example in train_text_to_image.py

dora-lemon commented 1 year ago

this is my bash script, i just add enable_xformers_memory_efficient_attention compared to the original

export MODEL_NAME="CompVis/stable-diffusion-v1-4"
export dataset_name="lambdalabs/pokemon-blip-captions"

accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --dataset_name=$dataset_name \
  --use_ema \
  --resolution=512 --center_crop --random_flip \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --gradient_checkpointing \
  --enable_xformers_memory_efficient_attention \
  --max_train_steps=15000 \
  --learning_rate=1e-05 \
  --max_grad_norm=1 \
  --lr_scheduler="constant" --lr_warmup_steps=0 \
  --output_dir="sd-pokemon-model"

as said in doc, it should be possible to fintune in a single 24GB 3090, but i still get cuda out of memory

The following values were not passed to `accelerate launch` and had defaults used instead:
    `--num_processes` was set to a value of `1`
    `--num_machines` was set to a value of `1`
    `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/accelerate/accelerator.py:249: FutureWarning: `logging_dir` is deprecated and will be removed in version 0.18.0 of 🤗 Accelerate. Use `project_dir` instead.
  warnings.warn(
04/13/2023 23:06:51 - INFO - __main__ - Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Mixed precision type: fp16

{'sample_max_value', 'thresholding', 'prediction_type', 'dynamic_thresholding_ratio', 'clip_sample_range', 'variance_type'} was not found in config. Values will be initialized to default values.
{'norm_num_groups'} was not found in config. Values will be initialized to default values.
{'timestep_post_act', 'projection_class_embeddings_input_dim', 'num_class_embeds', 'mid_block_only_cross_attention', 'resnet_time_scale_shift', 'only_cross_attention', 'conv_in_kernel', 'resnet_skip_time_act', 'time_embedding_type', 'encoder_hid_dim', 'class_embed_type', 'dual_cross_attention', 'class_embeddings_concat', 'upcast_attention', 'cross_attention_norm', 'time_embedding_act_fn', 'conv_out_kernel', 'resnet_out_scale_factor', 'time_cond_proj_dim', 'mid_block_type', 'use_linear_projection'} was not found in config. Values will be initialized to default values.
{'timestep_post_act', 'projection_class_embeddings_input_dim', 'num_class_embeds', 'mid_block_only_cross_attention', 'resnet_time_scale_shift', 'only_cross_attention', 'conv_in_kernel', 'resnet_skip_time_act', 'time_embedding_type', 'encoder_hid_dim', 'class_embed_type', 'dual_cross_attention', 'class_embeddings_concat', 'upcast_attention', 'cross_attention_norm', 'time_embedding_act_fn', 'conv_out_kernel', 'resnet_out_scale_factor', 'time_cond_proj_dim', 'mid_block_type', 'use_linear_projection'} was not found in config. Values will be initialized to default values.
04/13/2023 23:07:07 - WARNING - datasets.builder - Found cached dataset parquet (/home/user/.cache/huggingface/datasets/lambdalabs___parquet/lambdalabs--pokemon-blip-captions-10e3527a764857bd/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)
100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 997.93it/s]
04/13/2023 23:07:09 - INFO - __main__ - ***** Running training *****
04/13/2023 23:07:09 - INFO - __main__ -   Num examples = 833
04/13/2023 23:07:09 - INFO - __main__ -   Num Epochs = 72
04/13/2023 23:07:09 - INFO - __main__ -   Instantaneous batch size per device = 1
04/13/2023 23:07:09 - INFO - __main__ -   Total train batch size (w. parallel, distributed & accumulation) = 4
04/13/2023 23:07:09 - INFO - __main__ -   Gradient Accumulation steps = 4
04/13/2023 23:07:09 - INFO - __main__ -   Total optimization steps = 15000
Steps:   0%|                                                                                      | 0/15000 [00:00<?, ?it/s]/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/xformers/ops/fmha/flash.py:338: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
Steps:   0%|                                                           | 0/15000 [00:02<?, ?it/s, lr=1e-5, step_loss=0.0414]Traceback (most recent call last):
  File "train_text_to_image.py", line 926, in <module>
    main()
  File "train_text_to_image.py", line 853, in main
    optimizer.step()
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/accelerate/optimizer.py", line 134, in step
    self.scaler.step(self.optimizer, closure)
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 370, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 290, in _maybe_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/torch/optim/adamw.py", line 171, in step
    adamw(
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/torch/optim/adamw.py", line 321, in adamw
    func(
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/torch/optim/adamw.py", line 566, in _multi_tensor_adamw
    denom = torch._foreach_add(exp_avg_sq_sqrt, eps)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 50.00 MiB (GPU 0; 23.70 GiB total capacity; 21.48 GiB already allocated; 38.12 MiB free; 21.70 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Steps:   0%|                                                           | 0/15000 [00:02<?, ?it/s, lr=1e-5, step_loss=0.0414]
Traceback (most recent call last):
  File "/home/user/anaconda3/envs/dfs/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/accelerate/commands/launch.py", line 923, in launch_command
    simple_launcher(args)
  File "/home/user/anaconda3/envs/dfs/lib/python3.8/site-packages/accelerate/commands/launch.py", line 579, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/user/anaconda3/envs/dfs/bin/python', 'train_text_to_image.py', '--pretrained_model_name_or_path=CompVis/stable-diffusion-v1-4', '--dataset_name=lambdalabs/pokemon-blip-captions', '--use_ema', '--resolution=512', '--center_crop', '--random_flip', '--train_batch_size=1', '--gradient_accumulation_steps=4', '--gradient_checkpointing', '--enable_xformers_memory_efficient_attention', '--max_train_steps=15000', '--learning_rate=1e-05', '--max_grad_norm=1', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--output_dir=sd-pokemon-model']' returned non-zero exit status 1.

dora-lemon commented 1 year ago

I solved the problem by using 8 bit adam, is there any side effect?

hkristof03 commented 1 year ago

I am following the official tutorial.

It mentions "Diffusers now provides a LoRA fine-tuning script that can run in as low as 11 GB of GPU RAM without resorting to tricks such as 8-bit optimizers".

I have an RTX 3080 16 GB card, I use the default settings just like in the tutorial, batch size of 1, fp 16, 4 validation images. When the validation loop occurs I get CUDA OOM. I see in the script that during the validation the model which is under training is kept in GPU memory, and at the same time the script tries to load a new pipeline.

I am wondering if the script fails with a 16 GB then how is it possible to train with the mentioned 11 GB.

Does anyone have a solution for this?

sayakpaul commented 1 year ago

You're using the LoRA variant, right?

But the snippet you provided in the description is train_text_to_image.py, which is a non-LoRA. Just ensuring that was not by mistake.

I have an RTX 3080 16 GB card

I just tested it on a Tesla T4 and it worked. Could you provide a snapshot of what you get after running diffusers-cli env?

hkristof03 commented 1 year ago

The script I referred to is train_text_to_image_lora.py , referenced in the Huggingface LORA tutorial.

Btw I managed to solve the OOM issue with xformers...

sayakpaul commented 1 year ago

Okay then. The reason why I mentioned is that you stated:

accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
  ...

And not train_text_to_image_lora.py.

dora-lemon commented 1 year ago

@sayakpaul thanks for your comment, but @hkristof03 just put his problem on my issue, maybe the same topic, he did not use my script list above

kaimingd commented 1 year ago

@dora-lemon I met the same problem with you. Would you kindly tell me how to change adam to 8 bit？

sayakpaul commented 1 year ago

@sayakpaul thanks for your comment, but @hkristof03 just put his problem on my issue, maybe the same topic, he did not use my script list above

My reply still remains the same, though. Did you try the train_text_to_image_lora.py script and not the train_text_to_image.py script?

If the LoRA script is failing, then please consider enabling xformers with --enable_xformers_memory_efficient_attention. Know more here: https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-xformers

Does this help?

dora-lemon commented 1 year ago

@kaimingd just add --use_8bit_adam

kaimingd commented 1 year ago

@dora-lemon Thanks a lot. It's really worked!

sayakpaul commented 1 year ago

Closing the issue then :)

Please feel free to reopen in case of problems.

huggingface / diffusers

cuda out of memory when finetune pokemon example in train_text_to_image.py #3094