Open minienglish1 opened 3 weeks ago
Hey @minienglish1, thanks for reporting ! I think you are experiencing the same issue as this one: https://github.com/huggingface/accelerate/issues/3061. Could you try set unwrap=False
in your case ? We will try to fix this asap.
@SunMarc Thanks for the response. No rush to fix this.
Using accelerator.get_state_dict(unet, unwrap=False) appears to have fixed the problem, or at least let the pipeline be saved via save_pretrained. I haven't tried using the saved_pipeline yet.
System Info
Information
Tasks
no_trainer
script in theexamples
folder of thetransformers
repo (such asrun_no_trainer_glue.py
)Reproduction
code snippet - collecting unet to put into sdxl pipeline for save_pretrained: unet_state_dict = accelerator.get_state_dict(unet) if accelerator.is_main_process: pipeline_unet = UNet2DConditionModel.from_pretrained( pretrained_model_name_or_path, subfolder="unet" ) pipeline_unet.load_state_dict(unet_state_dict) ##fails here
error when using accelerate version 34.0, does not error with accelerate version 33.0: rank0: Traceback (most recent call last): rank0: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_05.py", line 848, in
rank0: File "/mnt/storage/projects/sdxl_trainer_v3/sdxl_v3_train_05.py", line 600, in main
rank0: File "/mnt/storage/projects/sdxl_trainer_v3/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2215, in load_state_dict rank0: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( rank0: RuntimeError: Error(s) in loading state_dict for UNet2DConditionModel: rank0: size mismatch for conv_in.weight: copying a param with shape torch.Size([11520]) from checkpoint, the shape in current model is torch.Size([320, 4, 3, 3]). rank0: size mismatch for time_embedding.linear_1.weight: copying a param with shape torch.Size([409600]) from checkpoint, the shape in current model is torch.Size([1280, 320]). rank0: size mismatch for time_embedding.linear_2.weight: copying a param with shape torch.Size([268802]) from checkpoint, the shape in current model is torch.Size([1280, 1280]). rank0: size mismatch for time_embedding.linear_2.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280]). rank0: size mismatch for conv_norm_out.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([320]). rank0: size mismatch for conv_norm_out.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([320]). rank0: size mismatch for conv_out.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4, 320, 3, 3]). rank0: size mismatch for conv_out.bias: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([4]).
Expected behavior
Expect same behavior on accelerate version 34.0 as accelerate version 33.0, load_state_dict() processes correctly.