rowanz / merlot

MERLOT: Multimodal Neural Script Knowledge Models
MIT License
223 stars 25 forks source link

merlot

MERLOT: Multimodal Neural Script Knowledge Models

MERLOT (NeurIPS 2021) is a model for learning what we are calling "neural script knowledge" -- representations about what is going on in videos, spanning multiple video frames with associated captions.

Visit our project page at rowanzellers.com/merlot, or read the full paper to learn more.

teaser

What's here

We are releasing the following:

We plan to release:

This is somewhat ongoing -- we hope to make it somewhat easier to adapt MERLOT to other tasks, please follow if interested!

Enviroment and setup

There are two different ways of running MERLOT right now

conda create --name merlot python=3.7 && conda activate merlot
conda install -y python=3.7 tqdm numpy pyyaml scipy ipython cython typing h5py pandas

# If running on GPU
pip install tensorflow-gpu==1.15.5
# If running on TPU
pip install tensorflow==1.15.5

pip install --upgrade google-api-python-client oauth2client boto3 cloud-tpu-profiler regex opencv-python-headless Pillow seaborn
pip install numpy==1.17.0

Pretraining from scratch

This requires a large TPU pod for data-parallelism.

Finetuning on downstream tasks

Bibtex

@inproceedings{zellersluhessel2021merlot,
  title={MERLOT: Multimodal Neural Script Knowledge Models},
  author={Zellers, Rowan and Lu, Ximing and Hessel, Jack and Yu, Youngjae and Park, Jae Sung and Cao, Jize and Farhadi, Ali and Choi, Yejin},
  booktitle={Advances in Neural Information Processing Systems 34},
  year={2021}
}