Closed asafarevich closed 5 years ago
@Anton-Velikodnyy can you post your full command here?
python scripts/make_captions.py \
--cfgs_file ./cfgs/yc2.yml \
--densecap_eval_file ./tool/densevid_eval/evaluate.py \
--batch_size 1 \
--start_from ./checkpoint/yc2-2L-e2e-mask/model_epoch_19.t7 \
--n_layer=2 \
--d_model 1024 \
--d_hidden 2048 \
--id yc2-2L-e2e-mask-19 \
--stride_factor 50 \
--in_emb_dropout 0.1 \
--attn_dropout 0.2 \
--vis_emb_dropout 0.1 \
--cap_dropout 0.2 \
--val_data_folder validation \
--learn_mask \
--gated_mask \
--cuda | tee ./data/log/eval-yc2-2L-e2e-mask-19
yc2 config file
# dataset specific settings for yc2
dataset: "yc2"
dataset_file: "video_feature/plastering.json"
feature_root: "video_feature"
densecap_references: ["./data/yc2/val_yc2.json"]
dur_file: "./data/yc2/yc2_duration_frame.csv"
kernel_list: [1, 3, 5, 7, 9, 11, 15, 21, 27, 33, 41, 49, 57, 71, 111, 161]
I did not see a reason to change the densecap_references or the dur_file.
video_feature/plasterin.json
(rearanged it for readability)
{"database":
{"plastering":
{"duration": "53.987267",
"subset": "validation",
"annotations": [ ],
"video_url": "/app/data/plastering.avi"}}}
Actually now that I am reviewing this, should I be using the masked transformer model instead of the end-to-end? If so whats the difference between the two?
If you're working with a new dataset while using the pre-trained model on yc2, you will need to revise the code to encode/numericalize your sentence with yc2 vocab, see here. Also, the words in your dataset should also appear in yc2 (otherwise they will be converted to "UNK"). If you decide to augment the yc2 vocab with your own vocab, you need to fine-tune the model on your data.
I get the following error
I traced it back to
vocab
size oftext_proc
being 4 for some reason (to be more specific, not sure why the items in the dataset determine the shape of the model) from when I give it empty list of annotations. Adding some annotations, increases the vocab size. Should I add 1017 (1011 - 4) vocab words to the annotation, or is there to bypass the 1011 limitation? Is the vocab size only 1011 different words?