huggingface / accelerate

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support
https://huggingface.co/docs/accelerate
Apache License 2.0
7.76k stars 939 forks source link

update accerate to 34.0 causes RuntimeError: Error(s) in loading state_dict for UNet2DConditionModel #3079

Open minienglish1 opened 3 weeks ago

minienglish1 commented 3 weeks ago

System Info

Custom sdxl training script using fsdp GRAD_SHARD_OP with cpu offset.

After upgrading accelerate from 33.0 to 34.0, after collecting state_dict with accelerator.get_state_dict(), error when using load_state_dict(). Works correctly on accelerate version 33.0.  Same unet model used for initiating training is used for load_state_dict().

Also tried unwrapping the model (unet).  But that also doesn't work, fails when saving pipeline via save_pretrained.  But that's a different issue.

accelerate env:
- `Accelerate` version: 0.34.0
- Platform: Linux-6.8.0-40-generic-x86_64-with-glibc2.35
- `accelerate` bash location: /mnt/storage/projects/sdxl_trainer_v3/venv/bin/accelerate
- Python version: 3.10.12
- Numpy version: 2.1.1
- PyTorch version (GPU?): 2.4.0+cu124 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- PyTorch MUSA available: False
- System RAM: 251.57 GB
- GPU type: NVIDIA GeForce RTX 4090

accelerate config yaml:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: true
fsdp_config:
  fsdp_activation_checkpointing: true
  fsdp_auto_wrap_policy: SIZE_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: false
  fsdp_forward_prefetch: true
  fsdp_min_num_params: 5000000
  fsdp_offload_params: true
  fsdp_sharding_strategy: SHARD_GRAD_OP
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
gpu_ids: 0,1,2
main_process_port: 29051
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 3
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Information

Tasks

Reproduction

code snippet - collecting unet to put into sdxl pipeline for save_pretrained: unet_state_dict = accelerator.get_state_dict(unet) if accelerator.is_main_process: pipeline_unet = UNet2DConditionModel.from_pretrained( pretrained_model_name_or_path, subfolder="unet" ) pipeline_unet.load_state_dict(unet_state_dict) ##fails here

error when using accelerate version 34.0, does not error with accelerate version 33.0: rank0: Traceback (most recent call last): rank0: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_05.py", line 848, in

rank0: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_05.py", line 600, in main

rank0: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict rank0: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( rank0: RuntimeError: Error(s) in loading state_dict for UNet2DConditionModel: rank0: size mismatch for conv_in.weight: copying a param with shape torch.Size([11520]) from checkpoint, the shape in current model is torch.Size([320, 4, 3, 3]). rank0: size mismatch for time_embedding.linear_1.weight: copying a param with shape torch.Size([409600]) from checkpoint, the shape in current model is torch.Size([1280, 320]). rank0: size mismatch for time_embedding.linear_2.weight: copying a param with shape torch.Size([268802]) from checkpoint, the shape in current model is torch.Size([1280, 1280]). rank0: size mismatch for time_embedding.linear_2.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280]). rank0: size mismatch for conv_norm_out.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([320]). rank0: size mismatch for conv_norm_out.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([320]). rank0: size mismatch for conv_out.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4, 320, 3, 3]). rank0: size mismatch for conv_out.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4]).

Expected behavior

Expect same behavior on accelerate version 34.0 as accelerate version 33.0, load_state_dict() processes correctly.

SunMarc commented 3 weeks ago

Hey @minienglish1, thanks for reporting ! I think you are experiencing the same issue as this one: https://github.com/huggingface/accelerate/issues/3061. Could you try set unwrap=False in your case ? We will try to fix this asap.

minienglish1 commented 3 weeks ago

@SunMarc Thanks for the response. No rush to fix this.

Using accelerator.get_state_dict(unet, unwrap=False) appears to have fixed the problem, or at least let the pipeline be saved via save_pretrained. I haven't tried using the saved_pipeline yet.