Problems with implementing multiple warm starts

benfig1127 commented 4 years ago

I am having trouble loading multiple models with the warm start portion of the code. I can warm start 1 model without problems, however if I try to use warm start a second time on the second model it throws the error below. In this case I renamed my first model "model1" and then used the following code:

python3 names.py --warm_start /home/ben/log/model1 --train --learning_rate=.1

This works fine and produces the following TensoBoard output: Screenshot from 2020-03-19 17-27-00 The next warm start attempt is where I run into problems. I then renamed this new model "model2" and tried to warm start from model2 to lower the learning rate to .01 with the following code.

 python3 names.py --warm_start /home/ben/log/model2 --train --learning_rate=.1

This is where the error occurs.

warm starting model from /home/ben/log/model2
Traceback (most recent call last):
  File "names.py", line 177, in <module>
    model.load_state_dict(model_dict['model_state_dict'])
  File "/home/ben/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 830, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Model:
    Missing key(s) in state_dict: "rnn.weight_ih_l0", "rnn.weight_hh_l0", "rnn.bias_ih_l0", "rnn.bias_hh_l0", "output.weight", "output.bias". 
    Unexpected key(s) in state_dict: "cnn.weight", "cnn.bias", "fc.weight", "fc.bias".

I tried inspecting what was in the folder for Model2 which I have also attached, but I do not understand what this error is telling me. If anyone has any thoughts on how I can fix this please let me know. Thanks!

log.zip

benfig1127 commented 4 years ago

@mikeizbicki I also am not sure how to add the HW6 label to this question, if you can show me how to do that, I would appreciate it.

mikeizbicki commented 4 years ago

I have a short and long answer for what I think is the problem for you.

The short answer: if you add all the parameters into the command line that you used for the original call to names.py, I believe this will fix the problem. For example, if you used --model=cnn in the first round of training, then you should also use --model=cnn on subsequent rounds.

The long answer: The saved parameters are stored in a file called model inside of the folder for each training run. In this file, there is a variable called state_dict that holds the parameters of the model. The key lines that do this loading are

if args.warm_start:
    print('warm starting model from',args.warm_start)
    model_dict = torch.load(os.path.join(args.warm_start,'model'))
    model.load_state_dict(model_dict['model_state_dict'])

The error message you are getting says that there is a problem with your state_dict variable, and I'll bet that the line the error is happening on is one of the above lines. To decipher the error message, look at the last two lines:

Missing key(s) in state_dict: "rnn.weight_ih_l0", "rnn.weight_hh_l0", "rnn.bias_ih_l0", "rnn.bias_hh_l0", "output.weight", "output.bias". 
Unexpected key(s) in state_dict: "cnn.weight", "cnn.bias", "fc.weight", "fc.bias".

Notice that it says that the rnn.X variables are missing from state_dict, but that some cnn.X variables were found instead. At this point in the code, we've already defined the model, and you are defining an rnn model because that is the default model type and you have not manually specified a model. If you had manually specified the cnn model, then when the load_state_dict function is called, it would know to be looking for the variables named cnn.X.

So why did it work on the first warm start but not the second?

Also inside the folder is a file named args which stores all the command line arguments that were passed in at the time of training. When warm starting, the first thing the code does is load these command line flags into the args variable using the code

# load args from file if warm starting
if args.warm_start is not None:
    import sys
    import os
    args_orig = args
    args = parser.parse_args(['@'+os.path.join(args.warm_start,'args')]+sys.argv[1:])
    args.train = args_orig.train

Unfortunately, when you warm start the model a second time, the args file that you are warm starting from does not contain the original parameters. When saving a model that starts from a warm start, I should have copied over the original args file into the new saved directory, but I didn't do that. And so when you warm start more than once, you need to fully specify all the parameters.

If all that makes sense, then this would actually be a fairly easy bug to fix (probably just 2-3 lines of code to copy the file).

benfig1127 commented 4 years ago

@mikeizbicki Awesome, I took a while to try and understand what was going on and tried to fix the bug (rather unsuccessfully) but jut restating all the parameters works fine. However my model is stuck at roughly 65% accuracy after multiple long run training's, is this normal? I investigated the accuracy by using the infer multiple times and found that the model was very good at predicting simple names, ie "Wong" "Dominguez" "Lewandoski" but more ambiguous names it was very rarely right. Any suggestions or help would be greatly appreciated.

mikeizbicki commented 4 years ago

65% accuracy is pretty low for this problem. I get 78% with the command

$ python3 names.py --train --gradient_clipping --model=gru --learning_rate=1e-1 --batch_size=10 --hidden_layer_size=128 --num_layers=1 --samples=100000

And then by reducing the learning rate to 1e-2 and warm starting I get 92%, and then using learning rate 1e-3 I get 95%.

benfig1127 commented 4 years ago

Okay, I will try out the above command and see if it was my hyperparamaters or there is something wrong in the code.

mikeizbicki / cmc-csci181-deeplearning

Problems with implementing multiple warm starts #5