linjieli222 / HERO

Research code for EMNLP 2020 paper "HERO: Hierarchical Encoder for Video+Language Omni-representation Pre-training"
https://arxiv.org/abs/2005.00200
MIT License
230 stars 34 forks source link

some questions of model details #22

Closed JinmingZhao closed 3 years ago

JinmingZhao commented 3 years ago

Hi,thank you for your released code,the “HERO” is a very interesting and amazing work. There are some questions about the model and hope for your response.

  1. For the cross-transformer initializated by 6 layers of the pretrained roberta, why not use all the 12 layers. As far as I know, most current models are based on pretrained bert-base or roberta-base model.
  2. Why did you use the order of [img, txt] instead of [txt, img]?
  3. Why did you pretrain the MLM task on the cross-transformer and mfm on the temporal-transformer?
  4. Due to the alignment data is hard to collect, so how much influence will it have if the local alignment is removed?

Thank you very much and looking forward to your reply!

linjieli222 commented 3 years ago

Thank you for your interests in our project and sorry about the late reply.

Please check my answers to your questions below:

  1. For the cross-transformer initializated by 6 layers of the pretrained roberta, why not use all the 12 layers. As far as I know, most current models are based on pretrained bert-base or roberta-base model.

We designed the model to be 6-layer to reduce GPU memory consumption during pre-training. You can modify our architecture to 12 layers for your experiments.

  1. Why did you use the order of [img, txt] instead of [txt, img]?

We design it to be [img,txt] to make it easier for gathering img representations for Temporal Transformer inputs. Besides, Transformers are known to be non-directional, meaning that the order of the input tokens do not matter.

  1. Why did you pretrain the MLM task on the cross-transformer and mfm on the temporal-transformer?

As we only use contextualized image representations as inputs to temporal transformer, MLM cannot be directly applied. Therefore, we apply MLM to cross-modal transformer.

  1. Due to the alignment data is hard to collect, so how much influence will it have if the local alignment is removed?

We did not perform ablation studies on the local alignment part. My conjecture is that it depends on the downstream task that you want to apply to. For the two VCMR tasks considered in HERO, local alignment learned during pre-training is beneficial for the moment retrieval sub-task in finetuning, and also the question localization sub-task in the two videoQA tasks.

Let me know if you have any further questions.

linjieli222 commented 3 years ago

Closed due to inactivity.

JinmingZhao commented 3 years ago

Hi linjie,

Another quesiton:

I am interesting in the FOM task, in your code, the "binary_targets" seems not used in this work?

Thanks, Jinming

JinmingZhao commented 3 years ago

@linjieli222