open-mmlab / mmdetection

OpenMMLab Detection Toolbox and Benchmark
https://mmdetection.readthedocs.io
Apache License 2.0
29.15k stars 9.39k forks source link

保存权重失败 #3517

Closed cpboost closed 4 years ago

cpboost commented 4 years ago

2020-08-09 13:40:22,676 - mmdet - INFO - Saving checkpoint at 11 epochs Traceback (most recent call last): File "tools/train.py", line 153, in main() File "tools/train.py", line 149, in main meta=meta) File "/content/drive/My Drive/mdet2/mmdet/apis/train.py", line 143, in train_detector runner.run(data_loaders, cfg.workflow, cfg.total_epochs) File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/epoch_based_runner.py", line 122, in run epoch_runner(data_loaders[i], *kwargs) File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/epoch_based_runner.py", line 46, in train self.call_hook('after_train_epoch') File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/base_runner.py", line 282, in call_hook getattr(hook, fn_name)(self) File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/dist_utils.py", line 93, in wrapper return func(args, kwargs) File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/hooks/checkpoint.py", line 52, in after_train_epoch self.out_dir, save_optimizer=self.save_optimizer, self.args) File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/epoch_based_runner.py", line 156, in save_checkpoint save_checkpoint(self.model, filepath, optimizer=optimizer, meta=meta) File "/usr/local/lib/python3.6/dist-packages/mmcv/runner/checkpoint.py", line 349, in save_checkpoint checkpoint['optimizer'] = optimizer.state_dict() File "/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py", line 98, in state_dict for k, v in self.state.items()} File "/usr/local/lib/python3.6/dist-packages/torch/optim/optimizer.py", line 98, in for k, v in self.state.items()} KeyError: 140388814569120

之前一直运行正常,今天刚出现的错误。这个keyerror报错的数字让我很难纠错

JSH261663 commented 4 years ago

我也遇到了相同的问题,请问这个问题您现在解决了吗,如果已经解决了可以分享一下如何解决的吗?谢谢!

cpboost commented 4 years ago

不是使用resume 只能通过loadform加载权重 和设置12批次保存一次权重,直接训练到结束,这样就能保存权重 ------------------ 原始邮件 ------------------ 发件人: "JSH261663"<notifications@github.com> 发送时间: 2020年8月16日(星期天) 凌晨0:09 收件人: "open-mmlab/mmdetection"<mmdetection@noreply.github.com>; 抄送: "cpboost"<1176466173@qq.com>;"State change"<state_change@noreply.github.com>; 主题: Re: [open-mmlab/mmdetection] 保存权重失败 (#3517)

JSH261663 commented 4 years ago

谢谢~经过排查我发现问题的根源在于pytorch版本,我将pytorch版本从1.6降为1.5,解决了该问题。

shinya7y commented 4 years ago

A solution for a similar issue on mmsegmentation works. https://github.com/open-mmlab/mmsegmentation/issues/127#issuecomment-692475646

We need to fix https://github.com/open-mmlab/mmdetection/blob/7a404a2c000620d52156774a5025070d9e00d918/mmdet/core/fp16/hooks.py#L46-L47 to runner.model = copy.deepcopy(runner.model)