Issue on training retrieval task---- model resume problem and dataset password

2000222 commented 4 years ago

Dear authors, I have been pleased to successfully test the first vgg model for in-shop retrieval and then I want to train it as the instructions as described in GETTING_STARTED.md, but I get some issues as follows： `python tools/train_retriever.py \

--config configs/retriever_in_shop/roi_retriever_vgg.py \
--resume_from checkpoint/Retrieve/vgg/roi/with_attr/latest.pth`

2020-06-17 12:36:24,767 - INFO - Distributed training: False 2020-06-17 12:36:25,619 - INFO - load model from: checkpoint/vgg16.pth pretrained model checkpoint/vgg16.pth The model and loaded state dict do not match exactly

unexpected key in source state_dict: classifier.0.weight, classifier.0.bias, classifier.3.weight, classifier.3.bias, classifier.6.weight, classifier.6.bias

model built dataset loaded dataloader built model paralleled 2020-06-17 12:36:31,942 - INFO - load checkpoint from checkpoint/Retrieve/vgg/roi/with_attr/latest.pth 2020-06-17 12:36:32,862 - WARNING - The model and loaded state dict do not match exactly

size mismatch for attr_predictor.linear_attr.weight: copying a param with shape torch.Size([1000, 4096]) from checkpoint, the shape in current model is torch.Size([463, 4096]). size mismatch for attr_predictor.linear_attr.bias: copying a param with shape torch.Size([1000]) from checkpoint, the shape in current model is torch.Size([463]). missing keys in source state_dict: embed_extractor.embed_linear.weight, embed_extractor.embed_linear.bias, embed_extractor.bn.weight, embed_extractor.bn.bias, embed_extractor.bn.running_mean, embed_extractor.bn.running_var, embed_extractor.id_linear.weight, embed_extractor.id_linear.bias

Traceback (most recent call last): File "tools/train_retriever.py", line 85, in main() File "tools/train_retriever.py", line 81, in main logger=logger) File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/mmfashion-0.4.0-py3.7.egg/mmfashion/apis/train_retriever.py", line 54, in train_retriever _non_dist_train(model, dataset, cfg, validate=validate) File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/mmfashion-0.4.0-py3.7.egg/mmfashion/apis/train_retriever.py", line 85, in _non_dist_train runner.resume(cfg.resume_from) File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/mmcv-0.5.9-py3.7-linux-x86_64.egg/mmcv/runner/runner.py", line 334, in resume self.optimizer.load_state_dict(checkpoint['optimizer']) File "/home/ubuntu/miniconda3/lib/python3.7/site-packages/torch/optim/optimizer.py", line 116, in load_state_dict raise ValueError("loaded state dict contains a parameter group " ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

So, I guess if I have utilized an unexact pretrained model (I utilized the second retrieval vgg model called “latest.pth”in your MODEL_ZOO.md ) or if I need to fix some code in train_retriever.py？ Another problem ：it seems we need a password to unzip Custom_to_shop retrieval dataset？ Could you share the password, thanks！

Hope to get some solutions for this amazing task. Thank you a lot！

veralauee commented 4 years ago

There are 2 problems in your implementation:

If you want to retrain a retrieval model, you just need to initialize weights from "vgg16.pth". You don't /should not load the pretrained checkpoint. Thus, make sure the "resume_from" in config file is "None".
If you want to do some finetune based on our pretrained model, I think you load the wrong checkpoint.

veralauee commented 4 years ago

Closed since no active comments.

open-mmlab / mmfashion

Issue on training retrieval task---- model resume problem and dataset password #69