Closed JinmingZhao closed 3 years ago
Thank you for your interests in our project and sorry about the late reply.
Please check my answers to your questions below:
- For the cross-transformer initializated by 6 layers of the pretrained roberta, why not use all the 12 layers. As far as I know, most current models are based on pretrained bert-base or roberta-base model.
We designed the model to be 6-layer to reduce GPU memory consumption during pre-training. You can modify our architecture to 12 layers for your experiments.
- Why did you use the order of [img, txt] instead of [txt, img]?
We design it to be [img,txt]
to make it easier for gathering img representations for Temporal Transformer inputs. Besides, Transformers are known to be non-directional, meaning that the order of the input tokens do not matter.
- Why did you pretrain the MLM task on the cross-transformer and mfm on the temporal-transformer?
As we only use contextualized image representations as inputs to temporal transformer, MLM cannot be directly applied. Therefore, we apply MLM to cross-modal transformer.
- Due to the alignment data is hard to collect, so how much influence will it have if the local alignment is removed?
We did not perform ablation studies on the local alignment part. My conjecture is that it depends on the downstream task that you want to apply to. For the two VCMR tasks considered in HERO, local alignment learned during pre-training is beneficial for the moment retrieval sub-task in finetuning, and also the question localization sub-task in the two videoQA tasks.
Let me know if you have any further questions.
Closed due to inactivity.
Hi linjie,
Another quesiton:
I am interesting in the FOM task, in your code, the "binary_targets" seems not used in this work?
Thanks, Jinming
@linjieli222
Hi,thank you for your released code,the “HERO” is a very interesting and amazing work. There are some questions about the model and hope for your response.
Thank you very much and looking forward to your reply!