hisfog / SfMNeXt-Impl

[AAAI 2024] Official implementation of "SQLdepth: Generalizable Self-Supervised Fine-Structured Monocular Depth Estimation", and more.
MIT License
91 stars 12 forks source link

Loading pretrained weights for EfficientNetB5 - Missing key(s) in state_dict #28

Open ionut-grigore99 opened 11 months ago

ionut-grigore99 commented 11 months ago

Hi!

When I attempt to load the pretrained weights you provided for EfficientNetB5, there appear to be some mismatches between the keys in the state_dict. Loading the weights was quite straightforward for ResNet50 and ConvNeXt, but this was not the case with EfficientNetB5

hisfog commented 11 months ago

The args_files/hisfog/kitti/effb5_320x1024.txt is for KITTI (Efficient-b5). So which args_file did you use?

Choi-YeongJoon commented 11 months ago

I have same issue. I used args_files/hisfog/kitti/effb5_320x1024.txt and below is my args.

--load_pretrained_model --load_pt_folder /SfMNeXt-Impl/models/pretrained/KITTI_effb5_320x1024 --image_path /SfMNeXt-Impl/images/231121_E100#3_KATRI/1m_10_start6.jpg --log_dir /SfMNeXt-Impl/logs --model_name effb5_320x1024 --dataset kitti --eval_split eigen --backbone tf_efficientnet_b5_ap --height 320 --width 1024 --batch_size 16 --num_epochs 25 --scheduler_step_size 15 --model_dim 32 --patch_size 32 --dim_out 128 --query_nums 128 --dec_channels 512 256 128 64 32 --min_depth 0.001 --max_depth 80.0 --diff_lr --use_stereo --eval_mono --post_process --save_pred_disps

below is an error message Traceback (most recent call last): File "test_simple_SQL_config.py", line 254, in test_simple(opt) File "test_simple_SQL_config.py", line 86, in test_simple outputs = model(input_image) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/SfMNeXt-Impl/SQLdepth.py", line 50, in forward return self.depth_decoder(x)["disp", 0] File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/SfMNeXt-Impl/networks/depth_decoder_QTR.py", line 52, in forward y = self.bins_regressor(summarys.view(bs, QE)) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py", line 119, in forward input = module(input) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, kwargs) File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/linear.py", line 94, in forward return F.linear(input, self.weight, self.bias) File "/usr/local/lib/python3.6/dist-packages/torch/nn/functional.py", line 1753, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: mat1 dim 1 must match mat2 dim 0

hisfog commented 11 months ago

RuntimeError: mat1 dim 1 must match mat2 dim 0

What's your input image size? Does H*W / patch_size^2 >= query_nums ?

Choi-YeongJoon commented 11 months ago

I tested it with images of several sizes. (1920 1080 ~ 1024 320) All of images are H*W / patch_size^2 >= query_nums. I am analyzing this error, and I found a layer's input and output size is different.

networks/depth_decoder_QTR.py line 22 self.bins_regressor = nn.Sequential(nn.Linear(embedding_dimquery_nums, 16 query_nums), nn.LeakyReLU(), nn.Linear(16 query_nums, 16 16), nn.LeakyReLU(), nn.Linear(16 * 16, dim_out))

line 54 y = self.bins_regressor(summarys.view(bs, Q * E))

Q E is not same with embedding_dimquery_nums.

Thank you for your quick response.

class DecoderBN(nn.Module): def init(self, num_features=2048, num_classes=1, bottleneck_features=2048): super(DecoderBN, self).init() features = int(num_features)

    self.conv2 = nn.Conv2d(bottleneck_features, features, kernel_size=1, stride=1, padding=1)

    self.up1 = UpSampleBN(skip_input=features // 1 + 112 + 64, output_features=features // 2)
    self.up2 = UpSampleBN(skip_input=features // 2 + 40 + 24, output_features=features // 4)
    self.up3 = UpSampleBN(skip_input=features // 4 + 24 + 16, output_features=features // 8)
    self.up4 = UpSampleBN(skip_input=features // 8 + 16 + 8, output_features=features // 16)

    self.up5 = UpSampleBN(skip_input=features // 16 + 3, output_features=features//16)  #
    self.conv3 = nn.Conv2d(features // 16, num_classes, kernel_size=3, stride=1, padding=1)
    # self.act_out = nn.Softmax(dim=1) if output_activation == 'softmax' else nn.Identity()

def forward(self, features):
    x_block0, x_block1, x_block2, x_block3, x_block4 = features[4], features[5], features[6], features[8], features[11]

    x_d0 = self.conv2(x_block4)

    x_d1 = self.up1(x_d0, x_block3)
    x_d2 = self.up2(x_d1, x_block2)
    x_d3 = self.up3(x_d2, x_block1)
    x_d4 = self.up4(x_d3, x_block0)
    x_d5 = self.up5(x_d4, features[0])  #
    out = self.conv3(x_d5) #
    # out = self.conv3(x_d4) #
    # out = self.act_out(out)
    # if with_features:
    #     return out, features[-1]
    # elif with_intermediate:
    #     return out, [x_block0, x_block1, x_block2, x_block3, x_block4, x_d1, x_d2, x_d3, x_d4]
    return out
ionut-grigore99 commented 11 months ago

My problem is that when I simply try to load the pretrained weights provided on repo, it seems that the keys for BaseEncoder don't match the keys from provided weights. I just did this:

model = BaseEncoder.build(num_features=256, model_dim=32)
model.from_pretrained(weights_path='/home/Desktop/SQLdepth/src/pretrained/KITTI_EfficientNetB5_320x1024/encoder.pth', device='cpu')

where I have this:

def from_pretrained(self, weights_path, device='cpu'):
        loaded_dict_enc = torch.load(weights_path, map_location=device)
        filtered_dict_enc = {k: v for k, v in loaded_dict_enc.items() if k in self.state_dict()}
        self.load_state_dict(filtered_dict_enc)
        self.eval()
hisfog commented 11 months ago

My problem is that when I simply try to load the pretrained weights provided on repo, it seems that the keys for BaseEncoder don't match the keys from provided weights. I just did this:

The pretrained KITTI efficient-b5 model does not use BaseEncoder as backbone, it uses Unet (--backbone tf_efficientnet_b5_ap). So you should use args_files/hisfog/kitti/effb5_320x1024.txt for KITTI efficient-b5 model.

ionut-grigore99 commented 11 months ago

And basically in this case the shape of the feature map from the EfficientNet encoder will be (c, h, w) or (c, h/2, w/2) ? It seems that the first shape is printed now.

hisfog commented 11 months ago

class DecoderBN(nn.Module): def init(self, num_features=2048, num_classes=1, bottleneck_features=2048):

The KITTI Efficient-b5 does not use this DecoderBN. The encoder should be Unet, NOT BaseEncoder, and NO DecoderBN,.

self.encoder = networks.Unet(pretrained=(not opt.load_pretrained_model), backbone=opt.backbone, in_channels=3, num_classes=opt.model_dim, decoder_channels=opt.dec_channels)

@ionut-grigore99 @Choi-YeongJoon

ionut-grigore99 commented 11 months ago

It works fine now, but the resolution of the resulting feature map appears to be the same as the input resolution. In contrast, for ConvNeXt and ResNet, the resolution is halved, as claimed in paper. I just want to know if it's the expected behaviour or not, thanks!

hisfog commented 11 months ago

It works fine now, but the resolution of the resulting feature map appears to be the same as the input resolution. In contrast, for ConvNeXt and ResNet, the resolution is halved, as claimed in paper. I just want to know if it's the expected behaviour or not, thanks!

Yes, that's expected!

Choi-YeongJoon commented 11 months ago

Now, It works fine, too. Thanks a lot ! @hisfog