huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.34k stars 875 forks source link

cpu_offload with diffusers save_pretrained occurs the error: NotImplementedError: Cannot copy out of meta tensor; no data! #2817

Open zengziru opened 1 month ago

zengziru commented 1 month ago

System Info

accelerate: 0.24.1
diffusers: 0.27.0
transformers: 4.30.2

The error:
Traceback (most recent call last):
  File "/mnt/vdb/qingluo/DiffusionDPO/train.py", line 1248, in <module>
    main()
  File "/mnt/vdb/qingluo/DiffusionDPO/train.py", line 1241, in main
    pipeline.save_pretrained(args.output_dir)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/diffusers/pipelines/pipeline_utils.py", line 279, in save_pretrained
    save_method(os.path.join(save_directory, pipeline_component_name), **save_kwargs)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/diffusers/models/modeling_utils.py", line 369, in save_pretrained
    safetensors.torch.save_file(
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 284, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 488, in _flatten
    return {
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 492, in <dictcomp>
    "data": _tobytes(v, k),
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/safetensors/torch.py", line 414, in _tobytes
    tensor = tensor.to("cpu")
NotImplementedError: Cannot copy out of meta tensor; no data!
Steps: : 0it [00:01, ?it/s]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3110) of binary: /mnt/vdb/qingluo/env/bin/python
Traceback (most recent call last):
  File "/workspace/qingluo/env/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 985, in launch_command
    multi_gpu_launcher(args)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/accelerate/commands/launch.py", line 654, in multi_gpu_launcher
    distrib_run.run(args)
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/vdb/qingluo/env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Information

Tasks

Reproduction

from accelerate import cpu_offload text_encoder.to(accelerator.device, dtype=weight_dtype) text_encoder = cpu_offload(text_encoder) vae = cpu_offload(vae) pipeline = StableDiffusionPipeline.from_pretrained( args.pretrained_model_name_or_path, text_encoder=text_encoder, vae=vae, unet=unet, revision=args.revision, ) pipeline.save_pretrained(args.output_dir)

Expected behavior

expected output: we can save the model... When I use the version accelerate 0.20.2, diffusers 0.20.0 it works However when I update the version, it failed

SunMarc commented 1 month ago

Hi @zengziru, could you share a full minimal reproducer ?

When I use the version accelerate 0.20.2, diffusers 0.20.0 it works However when I update the version, it failed

Does this happen after you update the accelerate library or the diffusers library ?

github-actions[bot] commented 2 days ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.