cpu_offload with diffusers save_pretrained occurs the error: NotImplementedError: Cannot copy out of meta tensor; no data!

zengziru commented 1 month ago

System Info

accelerate: 0.24.1
diffusers: 0.27.0
transformers: 4.30.2

The error:
Traceback (most recent call last):
  File "/mnt/vdb/qingluo/DiffusionDPO/train.py", line 1248, in <module>
    main()
  File "/mnt/vdb/qingluo/DiffusionDPO/train.py", line 1241, in main
    pipeline.save_pretrained(args.output_dir)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 279, in save_pretrained
    save_method(os.path.join(save_directory, pipeline_component_name), **save_kwargs)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 369, in save_pretrained
    safetensors.torch.save_file(
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 284, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 488, in _flatten
    return {
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 492, in <dictcomp>
    "data": _tobytes(v, k),
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 414, in _tobytes
    tensor = tensor.to("cpu")
NotImplementedError: Cannot copy out of meta tensor; no data!
Steps: : 0it [00:01, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3110) of binary: /mnt/vdb/qingluo/env/bin/python
Traceback (most recent call last):
  File "/workspace/qingluo/env/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

from accelerate import cpu_offload text_encoder.to(accelerator.device, dtype=weight_dtype) text_encoder = cpu_offload(text_encoder) vae = cpu_offload(vae) pipeline = StableDiffusionPipeline.from_pretrained( args.pretrained_model_name_or_path, text_encoder=text_encoder, vae=vae, unet=unet, revision=args.revision, ) pipeline.save_pretrained(args.output_dir)

Expected behavior

expected output: we can save the model... When I use the version accelerate 0.20.2, diffusers 0.20.0 it works However when I update the version, it failed

SunMarc commented 1 month ago

Hi @zengziru, could you share a full minimal reproducer ?

When I use the version accelerate 0.20.2, diffusers 0.20.0 it works However when I update the version, it failed

Does this happen after you update the accelerate library or the diffusers library ?

github-actions[bot] commented 2 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / accelerate