When running densecap with a single video which has an empty list of annotations.

asafarevich commented 5 years ago

I get the following error

RuntimeError: Error(s) in loading state_dict for ActionPropDenseCap:
    While copying the parameter named "cap_model.decoder.out.weight", whose dimensions in the model are torch.Size([4, 1024]) and whose dimensions in the checkpoint are torch.Size([1011, 1024]).
    While copying the parameter named "cap_model.decoder.out.bias", whose dimensions in the model are torch.Size([4]) and whose dimensions in the checkpoint are torch.Size([1011]).

I traced it back to vocab size of text_proc being 4 for some reason (to be more specific, not sure why the items in the dataset determine the shape of the model) from when I give it empty list of annotations. Adding some annotations, increases the vocab size. Should I add 1017 (1011 - 4) vocab words to the annotation, or is there to bypass the 1011 limitation? Is the vocab size only 1011 different words?

LuoweiZhou commented 5 years ago

@Anton-Velikodnyy can you post your full command here?

asafarevich commented 5 years ago

python scripts/make_captions.py \
  --cfgs_file ./cfgs/yc2.yml \
  --densecap_eval_file ./tool/densevid_eval/evaluate.py \
  --batch_size 1 \
  --start_from ./checkpoint/yc2-2L-e2e-mask/model_epoch_19.t7 \
  --n_layer=2 \
  --d_model 1024 \
  --d_hidden 2048 \
  --id yc2-2L-e2e-mask-19 \
  --stride_factor 50 \
  --in_emb_dropout 0.1 \
  --attn_dropout 0.2 \
  --vis_emb_dropout 0.1 \
  --cap_dropout 0.2 \
  --val_data_folder validation \
  --learn_mask \
  --gated_mask \
  --cuda | tee ./data/log/eval-yc2-2L-e2e-mask-19

yc2 config file

# dataset specific settings for yc2
dataset: "yc2"
dataset_file: "video_feature/plastering.json"
feature_root: "video_feature"
densecap_references: ["./data/yc2/val_yc2.json"]
dur_file: "./data/yc2/yc2_duration_frame.csv"
kernel_list: [1, 3, 5, 7, 9, 11, 15, 21, 27, 33, 41, 49, 57, 71, 111, 161]

I did not see a reason to change the densecap_references or the dur_file. video_feature/plasterin.json (rearanged it for readability)

{"database":
 {"plastering": 
   {"duration": "53.987267",
    "subset": "validation", 
    "annotations": [ ], 
    "video_url": "/app/data/plastering.avi"}}}

Actually now that I am reviewing this, should I be using the masked transformer model instead of the end-to-end? If so whats the difference between the two?

LuoweiZhou commented 5 years ago

If you're working with a new dataset while using the pre-trained model on yc2, you will need to revise the code to encode/numericalize your sentence with yc2 vocab, see here. Also, the words in your dataset should also appear in yc2 (otherwise they will be converted to "UNK"). If you decide to augment the yc2 vocab with your own vocab, you need to fine-tune the model on your data.

asafarevich commented 5 years ago

If I want to make a prediction using the model, would the part that parses the video captions, just be the entire vocabulary corpus?
Also why is the size of vocab so small? 1011 words isn't a lot. Is that all the words that were in the yc2 captions?

LuoweiZhou commented 5 years ago

For now, every time we run the train/eval, the vocab will be generated according to the current train+val data. But you can revise it in the way such that your specified vocab can be loaded. Say if vocab is provided, then do not generate new vocab in dataset.
Yes, in terms of the number of sentences, youcook2 is 1/4 as in ANet. So the vocab is much smaller. Vocab is defined as words with >=5 appearances.

salesforce / densecap

When running densecap with a single video which has an empty list of annotations. #6