caption my own video with provided pretrained model

dawnlh commented 3 years ago

Hi, thanks for the wonderful work. I want to caption my own videos giving the video frames (without transcript), can I use the pretrained weight (univl.pretrained.bin) provided in the repository directly to finish this task? I evaluated the pretained weightunivl.pretrained.bin directly on MSRVTT with the following code,

DATATYPE="msrvtt"
TRAIN_CSV="data/msrvtt/MSRVTT_train.9k.csv"
VAL_CSV="data/msrvtt/MSRVTT_JSFUSION_test.csv"
DATA_PATH="data/msrvtt/MSRVTT_data.json"
FEATURES_PATH="data/msrvtt/msrvtt_videos_features.pickle"
INIT_MODEL="weight/univl.pretrained.bin"
OUTPUT_ROOT="ckpts"

python -m torch.distributed.launch --nproc_per_node=1 \
main_task_caption.py \
--do_eval --num_thread_reader=4 \
--val_csv ${VAL_CSV} \
--data_path ${DATA_PATH} \
--features_path ${FEATURES_PATH} \
--output_dir ${OUTPUT_ROOT}/ckpt_msrvtt_caption --bert_model bert-base-uncased \
--do_lower_case \
--batch_size_val 32 --visual_num_hidden_layers 6 \
--decoder_num_hidden_layers 3 --datatype ${DATATYPE} --stage_two \
--init_model ${INIT_MODEL}

but got a very low metric value:

BLEU_1: 0.1410, BLEU_2: 0.0450, BLEU_3: 0.0142, BLEU_4: 0.0052
 METEOR: 0.0684, ROUGE_L: 0.1229, CIDEr: 0.0045

Emmm, I'm a fresher of this field, I would appreciate it a lot if you can provide some suggestions, instructions or codes on making use of provided pretrained model to deal with video captioning tasks in the real cases. (Perhaps main points lie in pretrained model, feature extraction and result visualization?)

ArrowLuo commented 3 years ago

Hi @dawnlh, would you provide your log.txt here? I can not locate the problem through the command.

dawnlh commented 3 years ago

Hi @dawnlh, would you provide your log.txt here? I can not locate the problem through the command.

Thanks a lot! Here is the log file:

2021-05-25 11:15:57,643:INFO: Effective parameters:
2021-05-25 11:15:57,644:INFO:   <<< batch_size: 256
2021-05-25 11:15:57,644:INFO:   <<< batch_size_val: 32
2021-05-25 11:15:57,644:INFO:   <<< bert_model: bert-base-uncased
2021-05-25 11:15:57,644:INFO:   <<< cache_dir: 
2021-05-25 11:15:57,644:INFO:   <<< coef_lr: 0.1
2021-05-25 11:15:57,644:INFO:   <<< cross_model: cross-base
2021-05-25 11:15:57,644:INFO:   <<< cross_num_hidden_layers: 2
2021-05-25 11:15:57,644:INFO:   <<< data_path: data/msrvtt/MSRVTT_data.json
2021-05-25 11:15:57,644:INFO:   <<< datatype: msrvtt
2021-05-25 11:15:57,644:INFO:   <<< decoder_model: decoder-base
2021-05-25 11:15:57,644:INFO:   <<< decoder_num_hidden_layers: 3
2021-05-25 11:15:57,644:INFO:   <<< do_eval: True
2021-05-25 11:15:57,644:INFO:   <<< do_lower_case: True
2021-05-25 11:15:57,644:INFO:   <<< do_pretrain: False
2021-05-25 11:15:57,644:INFO:   <<< do_train: False
2021-05-25 11:15:57,644:INFO:   <<< epochs: 20
2021-05-25 11:15:57,644:INFO:   <<< feature_framerate: 1
2021-05-25 11:15:57,644:INFO:   <<< features_path: data/msrvtt/msrvtt_videos_features.pickle
2021-05-25 11:15:57,644:INFO:   <<< fp16: False
2021-05-25 11:15:57,644:INFO:   <<< fp16_opt_level: O1
2021-05-25 11:15:57,644:INFO:   <<< gradient_accumulation_steps: 1
2021-05-25 11:15:57,644:INFO:   <<< hard_negative_rate: 0.5
2021-05-25 11:15:57,644:INFO:   <<< init_model: weight/univl.pretrained.bin
2021-05-25 11:15:57,644:INFO:   <<< local_rank: 0
2021-05-25 11:15:57,644:INFO:   <<< lr: 0.0001
2021-05-25 11:15:57,644:INFO:   <<< lr_decay: 0.9
2021-05-25 11:15:57,644:INFO:   <<< margin: 0.1
2021-05-25 11:15:57,644:INFO:   <<< max_frames: 100
2021-05-25 11:15:57,644:INFO:   <<< max_words: 20
2021-05-25 11:15:57,644:INFO:   <<< min_time: 5.0
2021-05-25 11:15:57,645:INFO:   <<< n_display: 100
2021-05-25 11:15:57,645:INFO:   <<< n_gpu: 1
2021-05-25 11:15:57,645:INFO:   <<< n_pair: 1
2021-05-25 11:15:57,645:INFO:   <<< negative_weighting: 1
2021-05-25 11:15:57,645:INFO:   <<< num_thread_reader: 4
2021-05-25 11:15:57,645:INFO:   <<< output_dir: ckpts/ckpt_msrvtt_caption
2021-05-25 11:15:57,645:INFO:   <<< sampled_use_mil: False
2021-05-25 11:15:57,645:INFO:   <<< seed: 42
2021-05-25 11:15:57,645:INFO:   <<< stage_two: True
2021-05-25 11:15:57,645:INFO:   <<< task_type: caption
2021-05-25 11:15:57,645:INFO:   <<< text_num_hidden_layers: 12
2021-05-25 11:15:57,645:INFO:   <<< train_csv: data/youcookii_singlef_train.csv
2021-05-25 11:15:57,645:INFO:   <<< use_mil: False
2021-05-25 11:15:57,645:INFO:   <<< val_csv: data/msrvtt/MSRVTT_JSFUSION_test.csv
2021-05-25 11:15:57,645:INFO:   <<< video_dim: 1024
2021-05-25 11:15:57,645:INFO:   <<< visual_model: visual-base
2021-05-25 11:15:57,645:INFO:   <<< visual_num_hidden_layers: 6
2021-05-25 11:15:57,645:INFO:   <<< warmup_proportion: 0.1
2021-05-25 11:15:57,645:INFO:   <<< world_size: 1
2021-05-25 11:15:57,646:INFO: device: cuda:0 n_gpu: 1
2021-05-25 11:15:57,646:INFO: loading vocabulary file /data2/zzh/project/SCI_caption/UniVL/modules/bert-base-uncased/vocab.txt
2021-05-25 11:15:58,017:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/bert-base-uncased
2021-05-25 11:15:58,018:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/visual-base
2021-05-25 11:15:58,018:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 1,
  "type_vocab_size": 2,
  "vocab_size": 1024
}

2021-05-25 11:15:58,018:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/visual-base/visual_pytorch_model.bin
2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/cross-base
2021-05-25 11:15:58,018:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 1024,
  "num_attention_heads": 12,
  "num_hidden_layers": 2,
  "type_vocab_size": 2,
  "vocab_size": 768
}

2021-05-25 11:15:58,018:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/cross-base/cross_pytorch_model.bin
2021-05-25 11:15:58,018:INFO: loading archive file /data2/zzh/project/SCI_caption/UniVL/modules/decoder-base
2021-05-25 11:15:58,019:INFO: Model config {
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_target_embeddings": 512,
  "num_attention_heads": 12,
  "num_decoder_layers": 1,
  "num_hidden_layers": 12,
  "type_vocab_size": 2,
  "vocab_size": 30522
}

2021-05-25 11:15:58,019:INFO: Weight doesn't exsits. /data2/zzh/project/SCI_caption/UniVL/modules/decoder-base/decoder_pytorch_model.bin
2021-05-25 11:15:58,019:WARNING: Stage-One:False, Stage-Two:True
2021-05-25 11:15:58,019:WARNING: Set bert_config.num_hidden_layers: 12.
2021-05-25 11:15:59,122:WARNING: Set visual_config.num_hidden_layers: 6.
2021-05-25 11:15:59,591:WARNING: Set cross_config.num_hidden_layers: 2.
2021-05-25 11:15:59,763:WARNING: Set decoder_config.num_decoder_layers: 3.
2021-05-25 11:16:02,843:INFO: --------------------
2021-05-25 11:16:02,843:INFO: Weights from pretrained model not used in UniVL: 
   cls.predictions.bias
   cls.predictions.transform.dense.weight
   cls.predictions.transform.dense.bias
   cls.predictions.transform.LayerNorm.weight
   cls.predictions.transform.LayerNorm.bias
   cls.predictions.decoder.weight
   cls_visual.predictions.weight
   cls_visual.predictions.bias
   cls_visual.predictions.transform.dense.weight
   cls_visual.predictions.transform.dense.bias
   cls_visual.predictions.transform.LayerNorm.weight
   cls_visual.predictions.transform.LayerNorm.bias
   similarity_pooler.dense.weight
   similarity_pooler.dense.bias
2021-05-25 11:16:10,136:INFO: ***** Running test *****
2021-05-25 11:16:10,136:INFO:   Num examples = 2990
2021-05-25 11:16:10,136:INFO:   Batch size = 32
2021-05-25 11:16:10,136:INFO:   Num steps = 94
2021-05-25 11:23:31,867:INFO: >>>  BLEU_1: 0.1410, BLEU_2: 0.0450, BLEU_3: 0.0142, BLEU_4: 0.0052
2021-05-25 11:23:31,877:INFO: >>>  METEOR: 0.0684, ROUGE_L: 0.1229, CIDEr: 0.0045

ArrowLuo commented 3 years ago

Hi @dawnlh, I suppose that you evaluate the pretrained weight (zero-shot) directly instead of finetuning. You should finetune with --do_train at first.

dawnlh commented 3 years ago

Hi @dawnlh, I suppose that you evaluate the pretrained weight (zero-shot) directly instead of finetuning. You should finetune with --do_train at first.

Yes, I evaluated the pretrained weight (zero-shot) directly. I tried to finetune the model, but failed due to limited GPU memory （even setting batch_size to 1) . Can you give an estimation about how much GPU memory is needed to finetune the model? Or is it convenient for you to share the weights for captioning task (no transcript) ?

ArrowLuo commented 3 years ago

Hi @dawnlh. We finetuned the model with 4 Tesla V100 GPUs. I am so sorry that we can not provide the finetuned weights.

dawnlh commented 3 years ago

Okay, thanks anyway~ I'll try to figure out the GPU limitation problem. Another question is that if you can provide some instructions or codes on making use of finetuned model to deal with video captioning tasks for self-captured videos? I mean the input video processing (how to extract the same feature as the training set to serve as the model input) and output visualization.

ArrowLuo commented 3 years ago

More information about the feature extractor can be found at README. The caption results are saved in --output_dir.

dawnlh commented 3 years ago

More information about the feature extractor can be found at README. The caption results are saved in --output_dir.

Got it! Thank you a lot for your patient replying.

microsoft / UniVL

caption my own video with provided pretrained model #10