salesforce / densecap

BSD 3-Clause "New" or "Revised" License
188 stars 61 forks source link

When I try running evaluation command, I get error with #4

Closed asafarevich closed 5 years ago

asafarevich commented 5 years ago

I tried running just the evaluation. I get this error. My gpu is 1080. Cuda and pytorch are installed. Verified installation works with answer from https://stackoverflow.com/questions/48152674/how-to-check-if-pytorch-is-using-the-gpu Still learning pytorch. I switched device to 0 instead of -1 and was able to get through that error. Why is the device set to -1?

    main()
  File "scripts/test.py", line 244, in main
    test_loader, text_proc = get_dataset(args)
  File "scripts/test.py", line 100, in get_dataset
    learn_mask=args.learn_mask)
  File "/home/hackerman/Github/densecap/data/anet_test_dataset.py", line 38, in __init__
    device=-1)  # put in memory
  File "/home/hackerman/anaconda3/envs/densenet/lib/python3.7/site-packages/torchtext/data/field.py", line 323, in numericalize
    var = torch.tensor(arr, dtype=self.dtype, device=device)
RuntimeError: Device index must not be negative
asafarevich commented 5 years ago

changed the device from -1 to 0, not sure why that was set like that

asafarevich commented 5 years ago

I am trying to run the evaluation, I get the following error. Looks like there is a mismatch in shape between provided model and evaluation built model.

building model
Initializing weights from ./checkpoint/anet-2L-gt-mask/model_epoch_19.t7
    main()
  File "scripts/test.py", line 247, in main
    model = get_model(text_proc, args)
  File "scripts/test.py", line 130, in get_model
    map_location=lambda storage, location: storage))
  File "/home/hackerman/anaconda3/envs/densenet/lib/python3.7/site-packages/torch/nn/modules/module.py", line 719, in load_state_dict
    self.__class__.__name__, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for ActionPropDenseCap:
    Unexpected key(s) in state_dict: "prop_out.16.0.weight", "prop_out.16.0.bias", "prop_out.16.0.running_mean", "prop_out.16.0.running_var", "prop_out.16.1.weight", "prop_out.16.2.weight", "prop_out.16.2.bias", "prop_out.16.2.running_mean", "prop_out.16.2.running_var", "prop_out.16.3.weight", "prop_out.16.4.weight", "prop_out.16.4.bias", "prop_out.16.4.running_mean", "prop_out.16.4.running_var", "prop_out.16.6.weight", "prop_out.16.6.bias", "prop_out.16.6.running_mean", "prop_out.16.6.running_var", "prop_out.16.7.weight", "prop_out.16.7.bias", "prop_out.17.0.weight", "prop_out.17.0.bias", "prop_out.17.0.running_mean", "prop_out.17.0.running_var", "prop_out.17.1.weight", "prop_out.17.2.weight", "prop_out.17.2.bias", "prop_out.17.2.running_mean", "prop_out.17.2.running_var", "prop_out.17.3.weight", "prop_out.17.4.weight", "prop_out.17.4.bias", "prop_out.17.4.running_mean", "prop_out.17.4.running_var", "prop_out.17.6.weight", "prop_out.17.6.bias", "prop_out.17.6.running_mean", "prop_out.17.6.running_var", "prop_out.17.7.weight", "prop_out.17.7.bias". 
    size mismatch for prop_out.1.1.weight: copying a param of torch.Size([1024, 1, 3]) from checkpoint, where the shape is torch.Size([1024, 1, 2]) in current model.
    size mismatch for prop_out.2.1.weight: copying a param of torch.Size([1024, 1, 5]) from checkpoint, where the shape is torch.Size([1024, 1, 3]) in current model.
    size mismatch for prop_out.3.1.weight: copying a param of torch.Size([1024, 1, 7]) from checkpoint, where the shape is torch.Size([1024, 1, 4]) in current model.
    size mismatch for prop_out.4.1.weight: copying a param of torch.Size([1024, 1, 9]) from checkpoint, where the shape is torch.Size([1024, 1, 5]) in current model.
    size mismatch for prop_out.5.1.weight: copying a param of torch.Size([1024, 1, 11]) from checkpoint, where the shape is torch.Size([1024, 1, 7]) in current model.
    size mismatch for prop_out.6.1.weight: copying a param of torch.Size([1024, 1, 15]) from checkpoint, where the shape is torch.Size([1024, 1, 9]) in current model.
    size mismatch for prop_out.7.1.weight: copying a param of torch.Size([1024, 1, 21]) from checkpoint, where the shape is torch.Size([1024, 1, 11]) in current model.
    size mismatch for prop_out.8.1.weight: copying a param of torch.Size([1024, 1, 27]) from checkpoint, where the shape is torch.Size([1024, 1, 15]) in current model.
    size mismatch for prop_out.9.1.weight: copying a param of torch.Size([1024, 1, 33]) from checkpoint, where the shape is torch.Size([1024, 1, 21]) in current model.
    size mismatch for prop_out.10.1.weight: copying a param of torch.Size([1024, 1, 41]) from checkpoint, where the shape is torch.Size([1024, 1, 29]) in current model.
    size mismatch for prop_out.11.1.weight: copying a param of torch.Size([1024, 1, 49]) from checkpoint, where the shape is torch.Size([1024, 1, 41]) in current model.
    size mismatch for cap_model.decoder.out.weight: copying a param of torch.Size([1011, 1024]) from checkpoint, where the shape is torch.Size([4563, 1024]) in current model.
    size mismatch for cap_model.decoder.out.bias: copying a param of torch.Size([1011]) from checkpoint, where the shape is torch.Size([4563]) in current model.

here is my script to start evaluation...

python3 scripts/test.py --cfgs_file ./cfgs/yc2.yml --densecap_eval_file ./tools/densevid_eval/evaluate.py --batch_size 1 \
    --start_from ./checkpoint/anet-2L-gt-mask/model_epoch_19.t7 --n_layers 2 --d_model 1024 --d_hidden 2048 --id anet-2L-gt-mask-19 \
    --stride_factor 50 --in_emb_dropout 0.1 --attn_dropout 0.2 --vis_emb_dropout 0.1 --cap_dropout 0.2 \
    --val_data_folder 'validation' --cuda | tee log/eval-anet-2L-gt-mask-19
LuoweiZhou commented 5 years ago

In the eval script, you need to set the config file to --cfgs_file ./cfgs/anet.yml because the rest of the command is for ActivityNet dataset rather than YouCook2 dataset.

asafarevich commented 5 years ago

@LuoweiZhou My appologies, fixing the command to coincide with yc2, I am able to get past that error. Here is the updated command

python3 scripts/test.py \
--cfgs_file ./cfgs/yc2.yml \
--densecap_eval_file ./tools/densevid_eval/evaluate.py \
--batch_size 1     \
--start_from ./checkpoint/yc2-2L-e2e-mask/model_epoch_19.t7 \
--n_layers=2 \
--d_model 1024 \
--d_hidden 2048 \
--id yc2-2L-e2e-mask-19 \
--stride_factor 50 \
--in_emb_dropout 0.1 \
--attn_dropout 0.2 \
--vis_emb_dropout 0.1 \
--cap_dropout 0.2 \
--val_data_folder 'validation' \
--cuda | tee ~/data/densecap/log/eval-yc2-2L-e2e-mask-19

The new error I get.

Namespace(attn_dropout=0.2, batch_size=1, cap_dropout=0.2, cfgs_file='./cfgs/yc2_new.yml', cuda=True, d_hidden=2048, d_model=1024, dataset='yc2', dataset_file='./data/yc2/yc2_annotations_trainval.json', densecap_eval_file='./tools/densevid_eval/evaluate.py', densecap_references=['./data/yc2/val_yc2.json'], dur_file='./data/yc2/yc2_duration_frame.csv', feature_root='$HOME/Github/densecap/data', gated_mask=False, id='yc2-2L-e2e-mask-19', image_feat_size=3072, in_emb_dropout=0.1, kernel_list=[1, 3, 5, 7, 9, 11, 15, 21, 27, 33, 41, 49, 57, 71, 111, 161], learn_mask=False, max_prop_num=500, max_sentence_len=20, min_prop_before_nms=200, min_prop_num=50, n_heads=8, n_layers=2, num_workers=2, pos_thresh=0.7, sampling_sec=0.5, slide_window_size=480, slide_window_stride=20, start_from='./checkpoint/yc2-2L-e2e-mask/model_epoch_19.t7', stride_factor=50, val_data_folder='validation', vis_emb_dropout=0.1)
loading dataset
# of words in the vocab: 1011
# of sentences in training: 10337, # of sentences in validation: 3492
# of training videos: 1333
total number of samples (unique videos): 0
total number of sentences: 0
building model
Initializing weights from ./checkpoint/yc2-2L-e2e-mask/model_epoch_19.t7
Traceback (most recent call last):
  File "scripts/test.py", line 255, in <module>
    main()
  File "scripts/test.py", line 249, in main
    recall_area = validate(model, test_loader, args)
  File "scripts/test.py", line 201, in validate
    print("average proposal number: {}".format(avg_prop_num/len(loader.dataset)))
ZeroDivisionError: division by zero

My understanding is that this error is because loader cannot find the dataset for features. I updated the config file, with feature_root to point to densecap/data in which there is a validation folder. But this does not fix the error.
I also tried same but with densecap/data/validation and that did not work either.

Not sure where I am making the mistake.

LuoweiZhou commented 5 years ago

No worries. It says your feature root is feature_root='$HOME/Github/densecap/data'. Can you check on if $feature_root\validation contains the .npy feature files for YouCook2? Furthermore, sample_list stores the val samples and you might want to double check if it is not empty.

BTW, you're running the e2e model, make sure you set --learn_mask True --gated_mask True

asafarevich commented 5 years ago

Thank you. I mistakenly thought $HOME would get expanded, where as it did not so my literal path was $HOME, but that is not a valid directory. Also, added --learn_mask --gated_mask thank you for catching that (did not need True for both). Thank you for all the help. What am going to see from the evaluation script, is it just print out of the scores?

AnukritiSinghh commented 4 years ago

No worries. It says your feature root is feature_root='$HOME/Github/densecap/data'. Can you check on if $feature_root\validation contains the .npy feature files for YouCook2? Furthermore, sample_list stores the val samples and you might want to double check if it is not empty.

BTW, you're running the e2e model, make sure you set --learn_mask True --gated_mask True

I am getting the same error of "RuntimeError: Device index must not be negative" . Is changing from -1 to 0 a solution?

AnukritiSinghh commented 4 years ago

I figured out the variables which are tensors in cuda and added .cpu() to them, it worked.