The shared tensors saving issue caused by safetensors comes to diffusers after updating diffusers and related libs

eeyrw commented 1 year ago

Describe the bug

It is about https://github.com/huggingface/safetensors/issues/202 and https://github.com/huggingface/transformers/pull/22437. For certain use case, the issue finally comes to diffusers. According my observation, the shared tensors are introduced by ZeroRedundancyOptimizer😏 and then comes to save_pretrained😏 method of ModelMixin😏 and finally make safetensors angry😤. Due to save_pretrained executing save procedure so I can not use trick save_model https://huggingface.co/docs/safetensors/torch_shared_tensors provided by safetensors.

Reproduction

Run CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 torchrun --nproc_per_node=6 SDSaveIssue.py

# SDSaveIssue.py
import torch
from diffusers import UNet2DConditionModel
from torch.distributed.optim import ZeroRedundancyOptimizer
import bitsandbytes as bnb

torch.distributed.init_process_group("nccl", init_method="env://")

rank = torch.distributed.get_rank()
torch.cuda.set_device(rank)

unet = UNet2DConditionModel.from_config({
    "_class_name": "UNet2DConditionModel",
    "_diffusers_version": "0.6.0",
    "act_fn": "silu",
    "attention_head_dim": 8,
    "block_out_channels": [
        320,
        640,
        1280,
        1280
    ],
    "center_input_sample": False,
    "cross_attention_dim": 768,
    "down_block_types": [
        "CrossAttnDownBlock2D",
        "CrossAttnDownBlock2D",
        "CrossAttnDownBlock2D",
        "DownBlock2D"
    ],
    "downsample_padding": 1,
    "flip_sin_to_cos": True,
    "freq_shift": 0,
    "in_channels": 4,
    "layers_per_block": 2,
    "mid_block_scale_factor": 1,
    "norm_eps": 1e-05,
    "norm_num_groups": 32,
    "out_channels": 4,
    "sample_size": 64,
    "up_block_types": [
        "UpBlock2D",
        "CrossAttnUpBlock2D",
        "CrossAttnUpBlock2D",
        "CrossAttnUpBlock2D"
    ]
}
)

unet.enable_gradient_checkpointing()
unet.set_use_memory_efficient_attention_xformers(True)

device = torch.device('cuda')
unet = unet.to(device, dtype=torch.float32)

unet = torch.nn.parallel.DistributedDataParallel(
    unet,
    device_ids=[rank],
    output_device=rank,
    gradient_as_bucket_view=True
)

optimizer_parameters = unet.parameters()    
optimizer = ZeroRedundancyOptimizer(
        optimizer_parameters,
        optimizer_class=bnb.optim.AdamW8bit,
        parameters_as_bucket_view=True,
        lr=1e-7,
        betas=(0.9, 0.9),
        eps=0.9,
        weight_decay=1e-6,
    )

if rank == 0:
    unet.module.save_pretrained(
        f'test/unet', safe_serialization=True)

torch.distributed.destroy_process_group()

Logs

(ldm) xxx@ubuntu:~/waifu-diffusion$ CUDA_VISIBLE_DEVICES=2,3,4,5,6,7 torchrun --nproc_per_node=6 trainer/SDSaveIssue.py
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "/home/xxx/waifu-diffusion/trainer/SDSaveIssue.py", line 75, in <module>
    unet.module.save_pretrained(
  File "/home/xxx/.conda/envs/ldm/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 323, in save_pretrained
    safetensors.torch.save_file(
  File "/home/xxx/.conda/envs/ldm/lib/python3.10/site-packages/safetensors/torch.py", line 232, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "/home/xxx/.conda/envs/ldm/lib/python3.10/site-packages/safetensors/torch.py", line 394, in _flatten
    raise RuntimeError(
RuntimeError: 
            Some tensors share memory, this will lead to duplicate memory on disk and potential differences when loading them again: [{'up_blocks.1.attentions.1.transformer_blocks.0.attn1.to_out.0.bias', 'up_blocks.3.resnets.2.time_emb_proj.bias', 'up_blocks.0.resnets.0.time_emb_proj.bias', 'mid_block.attentions.0.proj_out.bias', 'up_blocks.2.attentions.1.transformer_blocks.0.ff.net.0.proj.bias', 'up_blocks.3.resnets.2.norm1.weight', 'down_blocks.1.resnets.0.norm1.weight', 'up_blocks.1.attentions.0.proj_out.bias', 'up_blocks.3.attentions.2.transformer_blocks.0.attn2.to_out.0.bias', 'up_blocks.3.resnets.0.conv_shortcut.weight', 'up_blocks.2.attentions.2.transformer_blocks.0.attn1.to_k.weight', 'up_blocks.3.resnets.1.time_emb_proj.bias', 'down_blocks.3.resnets.0.norm1.weight', 'mid_block.attentions.0.transformer_blocks.0.ff.net.0.proj.bias', 'down_blocks.0.resnets.0.time_emb_proj.bias', 'down_blocks.1.attentions.1.transformer_blocks.0.ff.net.2.bias', 'up_blocks.0.resnets.0.conv2.bias', 'down_blocks.0.resnets.0.time_emb_proj.weight', 'up_blocks.1.resnets.1.norm1.weight', 'down_blocks.0.attentions.1.transformer_blocks.0.attn1.to_k.weight', 'up_blocks.1.resnets.2.norm2.bias', 'mid_block.attentions.0.norm.weight', 'down_blocks.1.resnets.1.norm2.weight', 'down_blocks.0.attentions.1.transformer_blocks.0.ff.net.0.proj.bias', 

...............Tell people what unet param dict keys are......................

'down_blocks.0.resnets.0.conv2.weight', 'up_blocks.3.attentions.1.transformer_blocks.0.norm1.weight', 'mid_block.resnets.0.norm2.bias', 'down_blocks.2.attentions.1.transformer_blocks.0.norm2.weight', 'up_blocks.1.attentions.2.transformer_blocks.0.ff.net.2.weight', 'down_blocks.2.attentions.0.norm.bias', 'up_blocks.3.attentions.2.proj_in.bias', 'down_blocks.2.attentions.1.transformer_blocks.0.norm1.weight', 'up_blocks.1.attentions.0.norm.weight', 'up_blocks.0.resnets.1.time_emb_proj.bias', 'up_blocks.0.resnets.0.conv1.weight', 'up_blocks.3.attentions.0.transformer_blocks.0.norm3.bias', 'up_blocks.3.attentions.0.transformer_blocks.0.attn2.to_k.weight', 'up_blocks.2.attentions.0.transformer_blocks.0.norm2.weight', 'up_blocks.1.attentions.2.transformer_blocks.0.norm1.bias', down_blocks.2.attentions.0.transformer_blocks.0.attn1.to_out.0.weight'}].
            A potential way to correctly save your model is to use `save_model`.
            More information at https://huggingface.co/docs/safetensors/torch_shared_tensors

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 12294) of binary: /home/xxx/.conda/envs/ldm/bin/python
Traceback (most recent call last):
  File "/home/xxx/.conda/envs/ldm/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/xxx/.conda/envs/ldm/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/xxx/.conda/envs/ldm/lib/python3.10/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/xxx/.conda/envs/ldm/lib/python3.10/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/xxx/.conda/envs/ldm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/xxx/.conda/envs/ldm/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
trainer/SDSaveIssue.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-07-29_23:21:08
  host      : ubuntu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 12294)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

System Info

diffusers version: 0.19.2
safetensors 0.3.1
Platform: Linux-4.15.0-142-generic-x86_64-with-glibc2.23
Python version: 3.10.0
PyTorch version (GPU?): 1.12.1+cu113 (True)
Huggingface_hub version: 0.14.1
Transformers version: 4.31.0
Accelerate version: 0.21.0
xFormers version: 0.0.15.dev+1515f77.d20221201
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: Yes

Who can help?

No response

patrickvonplaten commented 1 year ago

Hey @eeyrw ,

I think this issue should maybe be posted in PyTorch? https://github.com/huggingface/safetensors

patrickvonplaten commented 1 year ago

either way cc @Narsil here

Narsil commented 1 year ago

Great I have a good reproducible example.

Both Deepspeed and now this seem to be using this weird technique of messing up storage (sharing storage without sharing tensors). At least with a good example, I can figure what they are doing and either fix it in safetensors, inform users about what's going on.

Narsil commented 1 year ago

https://github.com/huggingface/safetensors/pull/309 Shoudl fix it.

I'll make a release once this and another PR are passed

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / diffusers