Closed asafarevich closed 5 years ago
changed the device from -1 to 0, not sure why that was set like that
I am trying to run the evaluation, I get the following error. Looks like there is a mismatch in shape between provided model and evaluation built model.
building model
Initializing weights from ./checkpoint/anet-2L-gt-mask/model_epoch_19.t7
main()
File "scripts/test.py", line 247, in main
model = get_model(text_proc, args)
File "scripts/test.py", line 130, in get_model
map_location=lambda storage, location: storage))
File "/home/hackerman/anaconda3/envs/densenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 719, in load_state_dict
self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ActionPropDenseCap:
Unexpected key(s) in state_dict: "prop_out.16.0.weight", "prop_out.16.0.bias", "prop_out.16.0.running_mean", "prop_out.16.0.running_var", "prop_out.16.1.weight", "prop_out.16.2.weight", "prop_out.16.2.bias", "prop_out.16.2.running_mean", "prop_out.16.2.running_var", "prop_out.16.3.weight", "prop_out.16.4.weight", "prop_out.16.4.bias", "prop_out.16.4.running_mean", "prop_out.16.4.running_var", "prop_out.16.6.weight", "prop_out.16.6.bias", "prop_out.16.6.running_mean", "prop_out.16.6.running_var", "prop_out.16.7.weight", "prop_out.16.7.bias", "prop_out.17.0.weight", "prop_out.17.0.bias", "prop_out.17.0.running_mean", "prop_out.17.0.running_var", "prop_out.17.1.weight", "prop_out.17.2.weight", "prop_out.17.2.bias", "prop_out.17.2.running_mean", "prop_out.17.2.running_var", "prop_out.17.3.weight", "prop_out.17.4.weight", "prop_out.17.4.bias", "prop_out.17.4.running_mean", "prop_out.17.4.running_var", "prop_out.17.6.weight", "prop_out.17.6.bias", "prop_out.17.6.running_mean", "prop_out.17.6.running_var", "prop_out.17.7.weight", "prop_out.17.7.bias".
size mismatch for prop_out.1.1.weight: copying a param of torch.Size([1024, 1, 3]) from checkpoint, where the shape is torch.Size([1024, 1, 2]) in current model.
size mismatch for prop_out.2.1.weight: copying a param of torch.Size([1024, 1, 5]) from checkpoint, where the shape is torch.Size([1024, 1, 3]) in current model.
size mismatch for prop_out.3.1.weight: copying a param of torch.Size([1024, 1, 7]) from checkpoint, where the shape is torch.Size([1024, 1, 4]) in current model.
size mismatch for prop_out.4.1.weight: copying a param of torch.Size([1024, 1, 9]) from checkpoint, where the shape is torch.Size([1024, 1, 5]) in current model.
size mismatch for prop_out.5.1.weight: copying a param of torch.Size([1024, 1, 11]) from checkpoint, where the shape is torch.Size([1024, 1, 7]) in current model.
size mismatch for prop_out.6.1.weight: copying a param of torch.Size([1024, 1, 15]) from checkpoint, where the shape is torch.Size([1024, 1, 9]) in current model.
size mismatch for prop_out.7.1.weight: copying a param of torch.Size([1024, 1, 21]) from checkpoint, where the shape is torch.Size([1024, 1, 11]) in current model.
size mismatch for prop_out.8.1.weight: copying a param of torch.Size([1024, 1, 27]) from checkpoint, where the shape is torch.Size([1024, 1, 15]) in current model.
size mismatch for prop_out.9.1.weight: copying a param of torch.Size([1024, 1, 33]) from checkpoint, where the shape is torch.Size([1024, 1, 21]) in current model.
size mismatch for prop_out.10.1.weight: copying a param of torch.Size([1024, 1, 41]) from checkpoint, where the shape is torch.Size([1024, 1, 29]) in current model.
size mismatch for prop_out.11.1.weight: copying a param of torch.Size([1024, 1, 49]) from checkpoint, where the shape is torch.Size([1024, 1, 41]) in current model.
size mismatch for cap_model.decoder.out.weight: copying a param of torch.Size([1011, 1024]) from checkpoint, where the shape is torch.Size([4563, 1024]) in current model.
size mismatch for cap_model.decoder.out.bias: copying a param of torch.Size([1011]) from checkpoint, where the shape is torch.Size([4563]) in current model.
here is my script to start evaluation...
python3 scripts/test.py --cfgs_file ./cfgs/yc2.yml --densecap_eval_file ./tools/densevid_eval/evaluate.py --batch_size 1 \
--start_from ./checkpoint/anet-2L-gt-mask/model_epoch_19.t7 --n_layers 2 --d_model 1024 --d_hidden 2048 --id anet-2L-gt-mask-19 \
--stride_factor 50 --in_emb_dropout 0.1 --attn_dropout 0.2 --vis_emb_dropout 0.1 --cap_dropout 0.2 \
--val_data_folder 'validation' --cuda | tee log/eval-anet-2L-gt-mask-19
In the eval script, you need to set the config file to --cfgs_file ./cfgs/anet.yml
because the rest of the command is for ActivityNet dataset rather than YouCook2 dataset.
@LuoweiZhou My appologies, fixing the command to coincide with yc2, I am able to get past that error. Here is the updated command
python3 scripts/test.py \
--cfgs_file ./cfgs/yc2.yml \
--densecap_eval_file ./tools/densevid_eval/evaluate.py \
--batch_size 1 \
--start_from ./checkpoint/yc2-2L-e2e-mask/model_epoch_19.t7 \
--n_layers=2 \
--d_model 1024 \
--d_hidden 2048 \
--id yc2-2L-e2e-mask-19 \
--stride_factor 50 \
--in_emb_dropout 0.1 \
--attn_dropout 0.2 \
--vis_emb_dropout 0.1 \
--cap_dropout 0.2 \
--val_data_folder 'validation' \
--cuda | tee ~/data/densecap/log/eval-yc2-2L-e2e-mask-19
The new error I get.
Namespace(attn_dropout=0.2, batch_size=1, cap_dropout=0.2, cfgs_file='./cfgs/yc2_new.yml', cuda=True, d_hidden=2048, d_model=1024, dataset='yc2', dataset_file='./data/yc2/yc2_annotations_trainval.json', densecap_eval_file='./tools/densevid_eval/evaluate.py', densecap_references=['./data/yc2/val_yc2.json'], dur_file='./data/yc2/yc2_duration_frame.csv', feature_root='$HOME/Github/densecap/data', gated_mask=False, id='yc2-2L-e2e-mask-19', image_feat_size=3072, in_emb_dropout=0.1, kernel_list=[1, 3, 5, 7, 9, 11, 15, 21, 27, 33, 41, 49, 57, 71, 111, 161], learn_mask=False, max_prop_num=500, max_sentence_len=20, min_prop_before_nms=200, min_prop_num=50, n_heads=8, n_layers=2, num_workers=2, pos_thresh=0.7, sampling_sec=0.5, slide_window_size=480, slide_window_stride=20, start_from='./checkpoint/yc2-2L-e2e-mask/model_epoch_19.t7', stride_factor=50, val_data_folder='validation', vis_emb_dropout=0.1)
loading dataset
# of words in the vocab: 1011
# of sentences in training: 10337, # of sentences in validation: 3492
# of training videos: 1333
total number of samples (unique videos): 0
total number of sentences: 0
building model
Initializing weights from ./checkpoint/yc2-2L-e2e-mask/model_epoch_19.t7
Traceback (most recent call last):
File "scripts/test.py", line 255, in <module>
main()
File "scripts/test.py", line 249, in main
recall_area = validate(model, test_loader, args)
File "scripts/test.py", line 201, in validate
print("average proposal number: {}".format(avg_prop_num/len(loader.dataset)))
ZeroDivisionError: division by zero
My understanding is that this error is because loader cannot find the dataset for features. I updated the config file, with feature_root
to point to densecap/data
in which there is a validation folder. But this does not fix the error.
I also tried same but with densecap/data/validation
and that did not work either.
Not sure where I am making the mistake.
No worries. It says your feature root is feature_root='$HOME/Github/densecap/data'
. Can you check on if $feature_root\validation
contains the .npy
feature files for YouCook2? Furthermore, sample_list stores the val samples and you might want to double check if it is not empty.
BTW, you're running the e2e model, make sure you set --learn_mask True --gated_mask True
Thank you. I mistakenly thought $HOME would get expanded, where as it did not so my literal path was $HOME
, but that is not a valid directory.
Also, added --learn_mask --gated_mask
thank you for catching that (did not need True for both).
Thank you for all the help.
What am going to see from the evaluation script, is it just print out of the scores?
No worries. It says your feature root is
feature_root='$HOME/Github/densecap/data'
. Can you check on if$feature_root\validation
contains the.npy
feature files for YouCook2? Furthermore, sample_list stores the val samples and you might want to double check if it is not empty.BTW, you're running the e2e model, make sure you set
--learn_mask True --gated_mask True
I am getting the same error of "RuntimeError: Device index must not be negative" . Is changing from -1 to 0 a solution?
I figured out the variables which are tensors in cuda and added .cpu() to them, it worked.
I tried running just the evaluation. I get this error. My gpu is 1080. Cuda and pytorch are installed. Verified installation works with answer from https://stackoverflow.com/questions/48152674/how-to-check-if-pytorch-is-using-the-gpu Still learning pytorch. I switched device to 0 instead of -1 and was able to get through that error. Why is the device set to -1?