train_text_to_image_sdxl.py Can't save model at checkpoint

clement-swk commented 8 months ago

Describe the bug

I am trying to finetune SDXL but the training script crashes when saving the model at a checkpoint. Training runs fine.

Reproduction

Here is my accelerate config choices:

This machine
No distributed training
No
No
yes (to use deepspeed)
no (don't specify a json)
2 (deepspeed's zero optimization stage 2)
cpu (to offload optimizer states on the cpu)
none (don't offload parameters)
4
no
no
1
fp16

Then I run this taken from the example in examples/text_to_image/README_sdxl.md

export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0"
export VAE_NAME="madebyollin/sdxl-vae-fp16-fix"
export DATASET_NAME="lambdalabs/pokemon-blip-captions"

Here I only modified the checkpointing_steps to cause the error to happen faster

accelerate launch train_text_to_image_sdxl.py \
  --pretrained_model_name_or_path=$MODEL_NAME \
  --pretrained_vae_model_name_or_path=$VAE_NAME \
  --dataset_name=$DATASET_NAME \
  --enable_xformers_memory_efficient_attention \
  --resolution=512 --center_crop --random_flip \
  --proportion_empty_prompts=0.2 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 --gradient_checkpointing \
  --max_train_steps=10000 \
  --use_8bit_adam \
  --learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0 \
  --mixed_precision="fp16" \
  --validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5 \
  --checkpointing_steps=5\
  --output_dir="sdxl-pokemon-model"

Logs

... (training starts) ...
Steps:   0%|          | 4/10000 [00:36<21:14:27,  7.65s/it, lr=1e-6, step_loss=0.0118][2024-03-14 02:46:19,909] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456, reducing to 134217728

Steps:   0%|          | 5/10000 [00:38<21:02:03,  7.58s/it, lr=1e-6, step_loss=0.0118]03/14/2024 02:46:19 - INFO - accelerate.accelerator - Saving current state to sdxl-pokemon-model/checkpoint-5
03/14/2024 02:46:19 - INFO - accelerate.accelerator - Saving DeepSpeed Model and Optimizer
[2024-03-14 02:46:19,913] [INFO] [logging.py:96:log_dist] [Rank 0] [Torch] Checkpoint pytorch_model is about to be saved!
/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py:1877: UserWarning: Positional args are being deprecated, use kwargs instead. Refer to https://pytorch.org/docs/master/generated/torch.nn.Module.html#torch.nn.Module.state_dict for details.
  warnings.warn(
[2024-03-14 02:46:19,937] [INFO] [logging.py:96:log_dist] [Rank 0] Saving model checkpoint: sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt
[2024-03-14 02:46:19,937] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt...
[2024-03-14 02:46:36,762] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved sdxl-pokemon-model/checkpoint-5/pytorch_model/mp_rank_00_model_states.pt.
[2024-03-14 02:46:36,766] [INFO] [torch_checkpoint_engine.py:21:save] [Torch] Saving sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt...
[2024-03-14 02:47:03,094] [INFO] [torch_checkpoint_engine.py:23:save] [Torch] Saved sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt.
[2024-03-14 02:47:03,095] [INFO] [engine.py:3488:_save_zero_checkpoint] zero checkpoint saved sdxl-pokemon-model/checkpoint-5/pytorch_model/zero_pp_rank_0_mp_rank_00_optim_states.pt
[2024-03-14 02:47:03,095] [INFO] [torch_checkpoint_engine.py:33:commit] [Torch] Checkpoint pytorch_model is ready now!
03/14/2024 02:47:03 - INFO - accelerate.accelerator - DeepSpeed Model and Optimizer saved to output dir sdxl-pokemon-model/checkpoint-5/pytorch_model
Configuration saved in sdxl-pokemon-model/checkpoint-5/unet/config.json
Traceback (most recent call last):
  File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 1312, in <module>
    main(args)
  File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 1169, in main
    accelerator.save_state(save_path)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 2706, in save_state
    hook(self._models, weights, output_dir)
  File "/root/diffusers/examples/text_to_image/train_text_to_image_sdxl.py", line 731, in save_model_hook
    model.save_pretrained(os.path.join(output_dir, "unet"))
  File "/root/diffusers/src/diffusers/models/modeling_utils.py", line 369, in save_pretrained
    safetensors.torch.save_file(
  File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 232, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "/usr/local/lib/python3.10/dist-packages/safetensors/torch.py", line 394, in _flatten
    raise RuntimeError(
RuntimeError: 
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'down_blocks.2.attentions.0.transformer_blocks.9.norm1.weight', 'up_blocks.0.attentions.0.transformer_blocks.5.attn2.to_out.0.bias', 'up_blocks.1.attentions.0.transformer_blocks.1.attn2.to_v.weight', 
...... (lots of layers) .....
'up_blocks.0.attentions.2.transformer_blocks.8.norm1.bias', 'up_blocks.0.attentions.0.transformer_blocks.1.attn2.to_out.0.bias'}].
            A potential way to correctly save your model is to use `save_model`.
            More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

[2024-03-14 02:47:08,088] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 10510) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1002, in launch_command
    deepspeed_launcher(args)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 718, in deepspeed_launcher
    distrib_run.run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train_text_to_image_sdxl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-03-14_02:47:08
  host      : 4e28de93c858
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 10510)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System Info

diffusers version: 0.27.0.dev0
Platform: Linux-5.4.0-150-generic-x86_64-with-glibc2.35
Python version: 3.10.12
PyTorch version (GPU?): 2.2.1+cu121 (True)
Huggingface_hub version: 0.21.4
Transformers version: 4.36.2
Accelerate version: 0.25.0
xFormers version: 0.0.24
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

I have an RTX4090.

Who can help?

@sayakpaul

sayakpaul commented 8 months ago

Does it happen without DeepSpeed? I am sadly not well-versed in DeepSpeed, so cannot help much.

https://github.com/huggingface/diffusers/pull/6628/files should fix the problem I think.

clement-swk commented 8 months ago

@sayakpaul I need deepspeed, otherwise training won't start (nvidia out of memory error)

sayakpaul commented 8 months ago

I have edited my comment. See if that helps.

clement-swk commented 8 months ago

I have added the configuration in the command as

accelerate launch --config_file $ACCELERATE_CONFIG_FILE train_text_to_image_sdxl.py   --pretrained_model_name_or_
path=$MODEL_NAME   --pretrained_vae_model_name_or_path=$VAE_NAME   --dataset_name=$DATASET_NAME   --enable_xformers_memory_efficient_attention   --resolution=512 --center_crop --random_flip   --proportion_empty_prompts=0.2   --train_batch_size=1   --gradient_accumulation_steps=4 --gradient_checkpointing   --max_train_steps=10000   --use_8bit_adam   --learning_rate=1e-06 --lr_scheduler="constant" --lr_warmup_steps=0   --mixed_precision="fp16" --validation_prompt="a cute Sundar Pichai creature" --validation_epochs 5   --checkpointing_steps=5   --output_dir="sdxl-pokemon-model"

but the same problem happens.

This was the config.yaml file I had

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  gradient_accumulation_steps: 4
  gradient_clipping: 1.0
  offload_optimizer_device: cpu
  offload_param_device: none
  zero3_init_flag: false
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 1
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

sayakpaul commented 8 months ago

How about applying changes from https://github.com/huggingface/diffusers/pull/6628/? More specifically, the changes introduced in examples/text_to_image/train_text_to_image_lora_sdxl.py?

clement-swk commented 8 months ago

@sayakpaul Trying

if isinstance(unwrap_model(model), type(unwrap_model(unet))):
    model.save_pretrained(os.path.join(output_dir, "unet"))

in the code didn't change the error

sayakpaul commented 8 months ago

How about:

if isinstance(unwrap_model(model), type(unwrap_model(unet))):
+    unwrap_model(model).save_pretrained(os.path.join(output_dir, "unet"))

?

clement-swk commented 8 months ago

@sayakpaul The same error happens, even with

if isinstance(unwrap_model(model), type(unwrap_model(unet))):
+    unwrap_model(model).save_pretrained(os.path.join(output_dir, "unet"))

I have also tried running the train_text_to_image_lora_sdxl.py to see if worked and got the same error as in train_text_to_image_sdxl.py.

Deactivating deepspeed makes train_text_to_image_lora_sdxl.py work fine.

sayakpaul commented 8 months ago

Cc: @HelloWorldBeginner. Could you help here if you any pointers?

HelloWorldBeginner commented 8 months ago

I haven't used cpu offload in deepseed, but it's fine to use zero2 on 8xA100s.

AoqunJin commented 7 months ago

@clement-swk When using deepspeed, you can install apex, which will be automatically used in deepspeed. That works for me.

git clone https://github.com/NVIDIA/apex.git 
cd apex git checkout 741bdf50825a97664db08574981962d66436d16a 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --global-option="--cuda_ext" --global-option="--cpp_ext"

clement-swk commented 7 months ago

@AoqunJin Thanks for your reply! I tried installing apex but the problem remains.

AoqunJin commented 7 months ago

@clement-swk

You can also try removing accelerator.is_main_process. This will avoid having to call save_model only in the main process without being able to get the states of other devices.

In

    def save_model_hook(models, weights, output_dir):
        # if accelerator.is_main_process:

And

    train_loss = 0.0
    # if accelerator.is_main_process:
    if global_step % args.checkpointing_steps == 0:

At train_text_to_image_lora_sdxl.py

clement-swk commented 7 months ago

@AoqunJin I tried and it the same error appeared.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sayakpaul commented 4 months ago

Will give this a look.

sayakpaul commented 4 months ago

I proposed a couple of fixes here: https://github.com/huggingface/accelerate/issues/2787. Does this help?

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

jyy-1998 commented 3 weeks ago

@AoqunJin I tried and it the same error appeared. Hello, have you solved this problem? I also encountered the same problem when using deepspeed.

AoqunJin commented 3 weeks ago

@jyy-1998 In my case, it's caused by the fact that deepspeed saving the ZERO model requires parameters that are maintained by other processes in allgather.

You can try different diffusers versions, such as the earlier diffusers==0.11.1

huggingface / diffusers