Model load_state_dict issue with 'focalnet_base_iso_16.pth'

Hello @jwyang: I have another problem when I try the isotropic focalnets model of 'focalnet_base_iso_16.pth':

I initialize the model by

# isotropic FocalNets
model = FocalNet(depths=[12], patch_size=16, embed_dim=768, focal_levels=[3], focal_windows=[3], use_layerscale=True, use_postln=True).cuda()

and load 'focalnet_base_iso_16.pth' by

ckpt_path = "focalnet_base_iso_16.pth"
ckpt = torch.load(ckpt_path)
model.load_state_dict(ckpt['model'])
model.eval()

but I have the error as below:

_Error(s) in loading state_dict for FocalNet: Unexpected key(s) in state_dict: "layers.0.blocks.0.modulation.ln.weight", "layers.0.blocks.0.modulation.ln.bias", "layers.0.blocks.1.modulation.ln.weight", "layers.0.blocks.1.modulation.ln.bias", "layers.0.blocks.2.modulation.ln.weight", "layers.0.blocks.2.modulation.ln.bias", "layers.0.blocks.3.modulation.ln.weight", "layers.0.blocks.3.modulation.ln.bias", "layers.0.blocks.4.modulation.ln.weight", "layers.0.blocks.4.modulation.ln.bias", "layers.0.blocks.5.modulation.ln.weight", "layers.0.blocks.5.modulation.ln.bias", "layers.0.blocks.6.modulation.ln.weight", "layers.0.blocks.6.modulation.ln.bias", "layers.0.blocks.7.modulation.ln.weight", "layers.0.blocks.7.modulation.ln.bias", "layers.0.blocks.8.modulation.ln.weight", "layers.0.blocks.8.modulation.ln.bias", "layers.0.blocks.9.modulation.ln.weight", "layers.0.blocks.9.modulation.ln.bias", "layers.0.blocks.10.modulation.ln.weight", "layers.0.blocks.10.modulation.ln.bias", "layers.0.blocks.11.modulation.ln.weight", "layers.0.blocks.11.modulation.ln.bias".

It seems the architectures are different, is any code part out of date here please, I am still in the visualization.ipynb to run. Thank you!

And also about the model choices:

I see the visualization in huggingface you shared last time is quite good of the attentions visualisation, the model behind should be 'focalnet_base_iso_16.pth' right? as I checked the files there and if I understand correctly.

If my focus is to generate attention scores for all pixels in images, what models do you recommend, if pre-trained model: focalnet_base_iso_16.pth is best/good? or I'd better training one on my own data (mostly online advertisments images) based on some pre-trained-model? (It seems there is no way to evaluate the performance of attentions scores of the models except by eyes/intuitions.)

Sorry a bit long questions... Thanks a lot!

Best wishes, Wen

microsoft / FocalNet

Model load_state_dict issue with 'focalnet_base_iso_16.pth' #23