Optimizer state_dict loaded error in stepwise training

jwyang / graph-rcnn.pytorch

[ECCV 2018] Official code for "Graph R-CNN for Scene Graph Generation"

733 stars 157 forks source link

Optimizer state_dict loaded error in stepwise training #85

Closed qrzou closed 4 years ago

qrzou commented 4 years ago

I've pre-trained the object detector using this command:python main.py --config-file configs/faster_rcnn_res101.yaml and I've modified the path param WEIGHT_DET in sgg_res101_step.yaml file

However when I trained the model stepwise using the command python main.py --config-file configs/sgg_res101_step.yaml --algorithm $ALGORITHM, the error ValueError: loaded state dict has a different number of parameter groups occurred in the optimizer's load_state_dict function.

I'm wondering that if this pipeline support loading a pre-trained object detector in stepwise training.Thanks!

jwyang commented 4 years ago

Hi, @qrzou A quick solution to address this is commenting the lines starting from https://github.com/jwyang/graph-rcnn.pytorch/blob/d7ca37d1ac8825aa0950a92d063221a1a7042c16/lib/scene_parser/rcnn/utils/checkpoint.py#L67, which I have done in the newest commit.

qrzou commented 4 years ago

Thank you @jwyang ! but I found that there's no difference between the newest commit and original code. Should I disable loading optimizer and scheduler while stepwise training? And the object detector checkpoint I trained didn't include model's state_dict, only got optimizer, scheduler and iteration, which confused me a lot.

jwyang commented 4 years ago

@qrzou , in the newest commit, it should be already updated. See here: https://github.com/jwyang/graph-rcnn.pytorch/blob/b3d6c4f01eb8e7566c28a3dd6a6f8fbc3b7f665f/lib/scene_parser/rcnn/utils/checkpoint.py#L67

To check whether you have successfully obtained the object detector checkpoint, can you try to evaluate the object detection performance first? Also, you can download the checkpoint I shared in the README and try it out for sanity check.

qrzou commented 4 years ago

@jwyang , the loaded error was caused by the argparser https://github.com/jwyang/graph-rcnn.pytorch/blob/d7ca37d1ac8825aa0950a92d063221a1a7042c16/main.py#L95. When using this command in readme python main.py --config-file configs/faster_rcnn_res101.yaml, cfg.MODEL.ALGORITHM will be set to default value "sg_baseline", though faster_rcnn_res101.yaml file set the ALGORITHM to faster_rcnn. It leads the faster_rcnn ckpt path becoming "sg_baseline_joint0" not "faster_rcnn", which let the checkpointer load the optimizer and scheduler in stepwise training. A quick solution is setting the algorithm explicitly in object detector training: python main.py --config-file configs/faster_rcnn_res101.yaml --algorithm faster_rcnn.