KeyError: 'bn1.num_batches_tracked'

xiaopanchen commented 4 years ago

这段代码statedict[i].copy(param_dict[key])出错，使用的是torchvision 0.4.0 KeyError: 'bn1.num_batches_tracked'

ZhiZZhang commented 4 years ago

这段代码statedict[i].copy(param_dict[key])出错，使用的是torchvision 0.4.0 KeyError: 'bn1.num_batches_tracked'

这个是pytorch版本的问题。用高版本的时候可以直接把这个Key跳过去，对实验结果影响不大。

andreazuna89 commented 4 years ago

Hi. I am not able to achieve your performance on CUHK (The mAP I get is around 40%). I got the same problem. There is also a similar problem for similar layers (e.g. layer1.0.bn1.num_batches_tracked , layer1.0.bn2.num_batches_tracked). Are you sure we can skip these layers without loosing performance? In the requirements you suggest to use pytorch version == 0.4.0 but it is not available and the code only works with an updated pytorch version but with the highlighted problem of the layers. Can you help to solve this? Thanks a lot

ZhiZZhang commented 4 years ago

Hi. I am not able to achieve your performance on CUHK (The mAP I get is around 40%). I got the same problem. There is also a similar problem for similar layers (e.g. layer1.0.bn1.num_batches_tracked , layer1.0.bn2.num_batches_tracked). Are you sure we can skip these layers without loosing performance? In the requirements you suggest to use pytorch version == 0.4.0 but it is not available and the code only works with an updated pytorch version but with the highlighted problem of the layers. Can you help to solve this? Thanks a lot

What I suggest is to skip the params named "*.num_batches_tracked" instead of the layers! These params are added in the new versions of Pytorch after 0.4.0. However, the pre-trained model in Pytorch official website (as attached in this repo) doesn't include them. Thus, if you have to use a high-version Pytorch, skipping them when loading the pre-trained model is the only solution I can find currently. This shoud not affect subsequent re-id training much in theory, but I'm not sure for its practical effects.

HongweiZhang97 commented 4 years ago

你好，我在使用高版本pytorch加载过程中跳过了这一参数，在cuhk03数据集上的训练结果中mAP与文中一致，但rank1低了1.4个百分点，请问是由于这方面的影响吗？

ZhiZZhang commented 4 years ago

你好，我在使用高版本pytorch加载过程中跳过了这一参数，在cuhk03数据集上的训练结果中mAP与文中一致，但rank1低了1.4个百分点，请问是由于这方面的影响吗？

有可能。另一种可能性是单卡vs.多卡训练导致。

HongweiZhang97 commented 4 years ago

你好，感谢说明！我在多卡测试时确实观察到mAP下降问题，但与这并不是同一问题。后续我将尝试解决这一参数的影响，感谢你的帮助！

------------------ 原始邮件 ------------------ 发件人: "microsoft/Relation-Aware-Global-Attention-Networks" <notifications@github.com>; 发送时间: 2020年8月2日(星期天) 下午5:07 收件人: "microsoft/Relation-Aware-Global-Attention-Networks"<Relation-Aware-Global-Attention-Networks@noreply.github.com>; 抄送: "Hongwei Zhang"<1398936838@qq.com>;"Comment"<comment@noreply.github.com>; 主题: Re: [microsoft/Relation-Aware-Global-Attention-Networks] KeyError: 'bn1.num_batches_tracked' (#4)

你好，我在使用高版本pytorch加载过程中跳过了这一参数，在cuhk03数据集上的训练结果中mAP与文中一致，但rank1低了1.4个百分点，请问是由于这方面的影响吗？

有可能。另一种可能性是单卡vs.多卡训练导致。

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

sky186 commented 4 years ago

@ZhiZZhang 您好，因为我想像se 模块一样嵌入到训练任务中，对于其他的任务也可以直接有效的提升，所以对您的这几个参数，我想请问一下， opt = adam epoch = 300 batch = 64， lr_scheduler = LRScheduler(base_lr=0.0008, step=[80, 120, 160, 200, 240, 280, 320, 360], factor=0.5, warmup_epoch=20, warmup_begin_lr=0.000008) 这是您实验的最优参数吗，epoch迭代次数比较大，所以您训练多少epoch达到最好的精度昵？这两个模块加入到网络中有比较注意的训练参数有害，或者有益吗？我用的fastreid，项目中超参数训练，rgasc网络结构直接嵌入到mgn_ibna 网络中， epoch是60,初始学习率0.0035， CosineAnnealingLR（不是step)，目前结果明显是不好的，dukemtmc数据，只有map 71%，

您的实验参数是有什么理由吗？还是直接根据经验调整的吗？对于rgasc 模块嵌入后，其他模型训练也会有自己的最优参数，两者通常会有出入，

jpainam commented 4 years ago

Hi, i used the exact packages you defined your requirements.txt and yet, i couldn't find bn1.num_batches_tracked in the loaded weights. I decided to skipp the parameters as you said, but then the results is really far from the one your reported in your paper. I got

labeled cuhk03 dataset
Evaluated with "feat_" features and "cosine" metric:
Mean AP: 71.6%
CMC Scores
  top-1          77.1%
  top-5          89.5%
  top-10         94.1%
  top-20         96.4%
Evaluated with "feat" features and "cosine" metric:
Mean AP: 66.0%
CMC Scores
  top-1          69.6%
  top-5          85.6%
  top-10         91.4%
  top-20         95.1%

While, in your paper, you report top-1: 81.1 and mAP: 77.4. This is a huge gap. Can you release your checkpoints, so we can try.

PhilChina commented 3 years ago

你好，我在使用高版本pytorch加载过程中跳过了这一参数，在cuhk03数据集上的训练结果中mAP与文中一致，但rank1低了1.4个百分点，请问是由于这方面的影响吗？

你好，请问可以分享一下这部分（使用高版本pytorch加载过程中跳过了这一参数）是如何做的吗？？最好可以贴一下这部分的代码

jpainam commented 3 years ago

@PhilChina I think you can skip the params using this code

   def load_partial_param(self, state_dict, model_index, model_path):
        param_dict = torch.load(model_path)
        for i in state_dict:
            try:
                key = 'layer{}.'.format(model_index) + i
                state_dict[i].copy_(param_dict[key])
            except KeyError:
                continue
        del param_dict

    def load_specific_param(self, state_dict, param_name, model_path):
        param_dict = torch.load(model_path)
        for i in state_dict:
            try:
                key = param_name + '.' + i
                state_dict[i].copy_(param_dict[key])
            except KeyError:
                continue
        del param_dict

This is what i did it should work,

PhilChina commented 3 years ago

@PhilChina I think you can skip the params using this code

   def load_partial_param(self, state_dict, model_index, model_path):
        param_dict = torch.load(model_path)
        for i in state_dict:
            try:
                key = 'layer{}.'.format(model_index) + i
                state_dict[i].copy_(param_dict[key])
            except KeyError:
                continue
        del param_dict

    def load_specific_param(self, state_dict, param_name, model_path):
        param_dict = torch.load(model_path)
        for i in state_dict:
            try:
                key = param_name + '.' + i
                state_dict[i].copy_(param_dict[key])
            except KeyError:
                continue
        del param_dict

This is what i did it should work, OK, thank you very much @jpainam

microsoft / Relation-Aware-Global-Attention-Networks

KeyError: 'bn1.num_batches_tracked' #4