openai / guided-diffusion

MIT License
6.06k stars 807 forks source link

Sampling at 64x64 - Missing key(s) in state_dict / size mismatch - segfault #8

Open acardara opened 3 years ago

acardara commented 3 years ago

I want to sample images from the pretrained 64x64_diffusion model but am hitting a segfault with the suggested run configuration. I've downloaded the 64x64 checkpoints to a models folder and am running with the following flags.

!SAMPLE_FLAGS="--batch_size 4 --num_samples 100 --timestep_respacing 250"

!MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --diffusion_steps 1000 --dropout 0.1 --image_size 64 --learn_sigma True --noise_schedule cosine --num_channels 192 --num_head_channels 64 --num_res_blocks 3 --resblock_updown True --use_new_attention_order True --use_fp16 True --use_scale_shift_norm True"

!python image_sample.py $MODEL_FLAGS --model_path models/64x64_diffusion.pt $SAMPLE_FLAGS

At runtime, I get a slew of warnings about missing and unused keys before the code crashes via segfault:

Missing key(s) in state_dict: "input_blocks.3.0.op.weight", "input_blocks.3.0.op.bias", "input_blocks.4.0.skip_connection.weight", ..., "output_blocks.8.1.conv.bias".

Unexpected key(s) in state_dict: "label_emb.weight", "input_blocks.12.0.in_layers.0.weight", "input_blocks.12.0.in_layers.0.bias", ..., "output_blocks.11.2.out_layers.3.bias".

size mismatch for time_embed.0.weight: copying a param with shape torch.Size([768, 192]) from checkpoint, the shape in current model is torch.Size([512, 128]). ... size mismatch for out.2.bias: copying a param with shape torch.Size([6]) from checkpoint, the shape in current model is torch.Size([3]).

acardara commented 3 years ago

I matched the model architecture as suggested in #7, which removed the mismatch warnings, but the missing and unexpected key warnings are still there. I am still getting a segfault.

lmvgjp commented 2 years ago

Hi, I get a runtime error with the same message:

RuntimeError: Error(s) in loading state_dict for SuperResModel: Missing key(s) in state_dict: "input_blocks.3.0.op.weight", "input_blocks.3.0.op.bias", "input_blocks.6.0.op.weight", "input_blocks.6.0.op.bias", "input_blocks.9.0.op.weight", "input_blocks.9.0.op.bias", "input_blocks.12.0.op.weight", "input_blocks.12.0.op.bias", "input_blocks.15.0.op.weight", "input_blocks.15.0.op.bias", "output_blocks.2.2.conv.weight", "output_blocks.2.2.conv.bias", "output_blocks.5.2.conv.weight", "output_blocks.5.2.conv.bias", "output_blocks.8.1.conv.weight", "output_blocks.8.1.conv.bias", "output_blocks.11.1.conv.weight", "output_blocks.11.1.conv.bias", "output_blocks.14.1.conv.weight", "output_blocks.14.1.conv.bias".

Did you manage to solve it?

acardara commented 2 years ago

No, I didn't find a solution.

lmvgjp commented 2 years ago

ok, thanks for answering!

inbarhub commented 2 years ago

Hi, I encountered the same problem here. I guess that the published model has a bit different architecture than the one written in the code. Can you please see if they match?

inbarhub commented 2 years ago

I solved it by using 'restrict=False' flag when loading the model but the results I get are really poor. I guess this is because the model was not loaded well.

DiogoNeves commented 2 years ago

Any news on this? I'm hitting the same issue.

EDIT: I was defining the environment variables the wrong way (new to Jupyter 😅 )

DiogoNeves commented 2 years ago

@acardara I'm still having other issues but, I think this might help you. From your message you seem to be running this within a Jupyter Notebook. You're currently defining the environment variables using ! which shouldn't persist to other commands.

Try using %env like:

%env SAMPLE_FLAGS=...
!python image_sample.py ... ${SAMPLE_FLAGS}

I'm new to Jupyter and was running into a similar issue. My understanding is that !SAMPLE_FLAGS would only work if you run the python script in the same line, similar to inline setting a var in bash.
I haven't tried but !SAMPLE_FLAGS="..." python ... $SAMPLE_FLAGS" should work if I'm right.

shahdghorsi commented 2 years ago

I solved it by using 'restrict=False' flag when loading the model but the results I get are really poor. I guess this is because the model was not loaded well.

Hi @inbarhub I tried using restrict=False here :

model.load_state_dict( dist_util.load_state_dict(args.model_path, restrict=False) ) but it did not work

wangqiang9 commented 2 years ago

I solve the same problem! model.load_state_dict( dist_util.load_state_dict(args.model_path, map_location="cpu"), strict=False )

ONobody commented 1 year ago

@XDUWQ Can you add your contact information and ask?

UESTC-Med424-JYX commented 8 months ago

Hi, I get a runtime error with the same message:

RuntimeError: Error(s) in loading state_dict for SuperResModel: Missing key(s) in state_dict: "input_blocks.3.0.op.weight", "input_blocks.3.0.op.bias", "input_blocks.6.0.op.weight", "input_blocks.6.0.op.bias", "input_blocks.9.0.op.weight", "input_blocks.9.0.op.bias", "input_blocks.12.0.op.weight", "input_blocks.12.0.op.bias", "input_blocks.15.0.op.weight", "input_blocks.15.0.op.bias", "output_blocks.2.2.conv.weight", "output_blocks.2.2.conv.bias", "output_blocks.5.2.conv.weight", "output_blocks.5.2.conv.bias", "output_blocks.8.1.conv.weight", "output_blocks.8.1.conv.bias", "output_blocks.11.1.conv.weight", "output_blocks.11.1.conv.bias", "output_blocks.14.1.conv.weight", "output_blocks.14.1.conv.bias".

Did you manage to solve it?

I encountered the same problem, how did you solve it?

TimenoLong commented 5 months ago

I solve the same problem! model.load_state_dict( dist_util.load_state_dict(args.model_path, map_location="cpu"), strict=False )

Your method can only solve the problem of Missing key(s), but cannot solve the problem of Size Mismatch. error

Alexdbsdfs commented 2 months ago

@TimenoLong hello ,have you solved it?

DiogoNeves commented 2 months ago

@TimenoLong hello ,have you solved it?

This is the closest I got https://github.com/openai/guided-diffusion/issues/8#issuecomment-1139795831

Alexdbsdfs commented 2 months ago

Thanks for your reply @DiogoNeves
now i got a question: this classifier:

!/bin/bash

设置模型、分类器和采样的标志

TRAIN_FLAGS="--iterations 10000 --anneal_lr True --batch_size 32 --lr 3e-4 --save_interval 1000 --weight_decay 0.05" CLASSIFIER_FLAGS="--image_size 64 --classifier_attention_resolutions 32,16,8 --classifier_depth 2 --classifier_width 128 --classifier_pool attention --classifier_resblock_updown True --classifier_use_scale_shift_norm True"

执行命令

mpiexec -n N python scripts/classifier_train.py --data_dir /home/cumt306/dingbo/DCNv4/data/train $TRAIN_FLAGS $CLASSIFIER_FLAGS

this diffusion: MODEL_FLAGS="--image_size 64 --num_channels 128 --num_res_blocks 3" DIFFUSION_FLAGS="--diffusion_steps 4000 --noise_schedule linear" TRAIN_FLAGS="--lr 1e-4 --batch_size 32"

python scripts/image_train.py --data_dir /home/cumt306/dingbo/DCNv4/data/train $MODEL_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS

this sampling: MODEL_FLAGS="--attention_resolutions 32,16,8 --class_cond True --image_size 64 --learn_sigma True --num_channels 128 --num_heads 4 --num_res_blocks 3 --resblock_updown True --use_fp16 True --use_scale_shift_norm True" CLASSIFIER_FLAGS="--image_size 64 --classifier_attention_resolutions 32,16,8 --classifier_depth 4 --classifier_width 128 --classifier_pool attention --classifier_resblock_updown True --classifier_use_scale_shift_norm True --classifier_scale 1.0 --classifier_use_fp16 True" SAMPLE_FLAGS="--batch_size 32 --num_samples 1000 --timestep_respacing ddim25 --use_ddim True" mpiexec -n N python scripts/classifier_sample.py \ --model_path /home/cumt306/dingbo/guided-diffusion-main/diffusion_model/openai-2024-07-16-11-34-50-811594/model001000.pt\ --classifier_path /home/cumt306/dingbo/guided-diffusion-main/classer_model/openai-2024-07-16-11-28-48-194253/model001000.pt\ $MODEL_FLAGS $CLASSIFIER_FLAGS $SAMPLE_FLAGS

when i run samping : raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for UNetModel: size mismatch for out.2.weight: copying a param with shape torch.Size([3, 128, 3, 3]) from checkpoint, the shape in current model is torch.Size([6, 128, 3, 3]). size mismatch for out.2.bias: copying a param with shape torch.Size([3]) from checkpoint, the shape in current model is torch.Size([6]).

DiogoNeves commented 2 months ago

@Alexdbsdfs just looking from the phone so I cannot test.

But it looks like the train model UNet size does not match when you're setting it up to sample.
I'm guessing a lot here but check what's the default res block size for training.
I don't think you're setting it for training, but then you set it to 3 later.

A simple thing you could do as well set the blocks to 6 on the sampling.

Let me know if that helps

Alexdbsdfs commented 2 months ago

Thank you very much for your reply In diffusion train: i set--num_res_blocks= 3

in sampling: i set--num_res_blocks= 3 too

then i set num_res_blocks=6 ,Similar issues have also arisen. @DiogoNeves