MERLOT: Multimodal Neural Script Knowledge Models
MERLOT (NeurIPS 2021) is a model for learning what we are calling "neural script knowledge" -- representations about what is going on in videos, spanning multiple video frames with associated captions.
Visit our project page at rowanzellers.com/merlot, or read the full paper to learn more.
We are releasing the following:
We plan to release:
This is somewhat ongoing -- we hope to make it somewhat easier to adapt MERLOT to other tasks, please follow if interested!
There are two different ways of running MERLOT right now
conda create --name merlot python=3.7 && conda activate merlot
conda install -y python=3.7 tqdm numpy pyyaml scipy ipython cython typing h5py pandas
# If running on GPU
pip install tensorflow-gpu==1.15.5
# If running on TPU
pip install tensorflow==1.15.5
pip install --upgrade google-api-python-client oauth2client boto3 cloud-tpu-profiler regex opencv-python-headless Pillow seaborn
pip install numpy==1.17.0
This requires a large TPU pod for data-parallelism.
model
directory, run python train.py configs/merlot.yaml
You can download our checkpoint using download_checkpoint.py. There are two options -- we used a checkpoint with 4 frame-caption segments for general purpose pretraining, and then we trained it for longer (using 5 frame-caption segments) to adapt to the story ordering task.
We suggest using the 4 segments checkpoint because that's what we used for all of our finetuning experiments. This corresponds to the configuration at We used the configuration model/merlot.yaml.
MerlotModel
model/modeling.py, set up your finetuning task (usually involving an additional output layer), and finetune.@inproceedings{zellersluhessel2021merlot,
title={MERLOT: Multimodal Neural Script Knowledge Models},
author={Zellers, Rowan and Lu, Ximing and Hessel, Jack and Yu, Youngjae and Park, Jae Sung and Cao, Jize and Farhadi, Ali and Choi, Yejin},
booktitle={Advances in Neural Information Processing Systems 34},
year={2021}
}