Closed cpboost closed 4 years ago
我也遇到了相同的问题,请问这个问题您现在解决了吗,如果已经解决了可以分享一下如何解决的吗?谢谢!
不是使用resume 只能通过loadform加载权重 和设置12批次保存一次权重,直接训练到结束,这样就能保存权重 ------------------ 原始邮件 ------------------ 发件人: "JSH261663"<notifications@github.com> 发送时间: 2020年8月16日(星期天) 凌晨0:09 收件人: "open-mmlab/mmdetection"<mmdetection@noreply.github.com>; 抄送: "cpboost"<1176466173@qq.com>;"State change"<state_change@noreply.github.com>; 主题: Re: [open-mmlab/mmdetection] 保存权重失败 (#3517)
谢谢~经过排查我发现问题的根源在于pytorch版本,我将pytorch版本从1.6降为1.5,解决了该问题。
A solution for a similar issue on mmsegmentation works. https://github.com/open-mmlab/mmsegmentation/issues/127#issuecomment-692475646
We need to fix
https://github.com/open-mmlab/mmdetection/blob/7a404a2c000620d52156774a5025070d9e00d918/mmdet/core/fp16/hooks.py#L46-L47
to runner.model = copy.deepcopy(runner.model)
2020-08-09 13:40:22,676 - mmdet - INFO - Saving checkpoint at 11 epochs Traceback (most recent call last): File "tools/train.py", line 153, in
main()
File "tools/train.py", line 149, in main
meta=meta)
File "/content/drive/My Drive/mdet2/mmdet/apis/train.py", line 143, in train_detector
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/epoch_based_runner.py", line 122, in run
epoch_runner(data_loaders[i], *kwargs)
File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/epoch_based_runner.py", line 46, in train
self.call_hook('after_train_epoch')
File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/base_runner.py", line 282, in call_hook
getattr(hook, fn_name)(self)
File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/dist_utils.py", line 93, in wrapper
return func(args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/hooks/checkpoint.py", line 52, in after_train_epoch
self.out_dir, save_optimizer=self.save_optimizer, self.args)
File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/epoch_based_runner.py", line 156, in save_checkpoint
save_checkpoint(self.model, filepath, optimizer=optimizer, meta=meta)
File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/checkpoint.py", line 349, in save_checkpoint
checkpoint['optimizer'] = optimizer.state_dict()
File "/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py", line 98, in state_dict
for k, v in self.state.items()}
File "/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py", line 98, in
for k, v in self.state.items()}
KeyError: 140388814569120
之前一直运行正常,今天刚出现的错误。这个keyerror报错的数字让我很难纠错