neptune-ai / open-solution-mapping-challenge

Open solution to the Mapping Challenge :earth_americas:
https://www.crowdai.org/challenges/mapping-challenge
MIT License
377 stars 96 forks source link

Error while loading best.torch! #192

Closed carbonox-infernox closed 5 years ago

carbonox-infernox commented 5 years ago

When resuming my training, I get the following error:

RuntimeError: Error(s) in loading state_dict for DataParallel: Missing key(s) in state_dict: "module.encoder.conv1.weight", "module.encoder.bn1.weight", "module.encoder.bn1.bias", "module .encoder.bn1.running_mean", "module.encoder.bn1.running_var", "module.encoder.layer1.0.conv1.weight", "module.encoder.layer1.0.bn1.w eight", "module.encoder.layer1.0.bn1.bias", "module.encoder.layer1.0.bn1.running_mean", "module.encoder.layer1.0.bn1.running_var", " module.encoder.layer1.0.conv2.weight", "module.encoder.layer1.0.bn2.weight", "module.encoder.layer1.0.bn2.bias", "module.encoder.lay er1.0.bn2.running_mean", "module.encoder.layer1.0.bn2.running_var", "module.encoder.layer1.0.conv3.weight", "module.encoder.layer1.0 .bn3.weight", "module.encoder.layer1.0.bn3.bias", "module.encoder.layer1.0.bn3.running_mean", "module.encoder.layer1.0.bn3.running_v ar", "module.encoder.layer1.0.downsample.0.weight"...

And then it goes on like that for a long time. I've had this happen to me before, after changing the meta_train sample size, so I decided to see if it would happen if i didn't change anything. So I aborted training mid-epoch (so it wasn't writing to best.torch) and then started training again immediately on the same machine without changing anything.

It happened anyway. What can I do about this?

Also while we're here, is it normal for the model to only be at 0.719 average precision and 0.794 average recall after 17 epochs? The best it ever did was after the 10th epoch with 0.768 and 0.833 respectively, and it tied with that again after the 13th but it's just not improving.

carbonox-infernox commented 5 years ago

I found the source of the problem. For every missing key, e.g. module.encoder.conv1.weight, there is an unexpected key module.module.encoder.conv1.weight. So the name of each weight has been prepended with an additional "model."

The solution is given here: https://github.com/pytorch/pytorch/issues/3805