maximek3 / e-ViL

40 stars 5 forks source link


This repository contains the e-SNLI-VE dataset, the HTML files for the e-ViL human evaluation framework, and e-UG model of our ICCV 2021 paper:

e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks (ICCV 2021).


The train, dev, and test splits are in the data folder. The .csv files contain Flickr30k Image ID's. Flickr30k can be downloaded here.

e-ViL MTurk Questionnaires

The e-ViL_MTurk folder contains the MTurk questionnaires for e-SNLI-VE, VQA-X, and VCR. These HTML files can be uploaded to the Amazon Mechanical Turk platform for crowd-sourced, human evaluation.


e-UG uses UNITER as vision-language model and GPT-2 to generate explanations. The UNITER implementation is based on the code of the Transformers-VQA repo and the GPT-2 implementation is based on Marasovic et al. 2020.

The entry point for training and testing the models is in


The environment file is in eUG.yml.

Create the environment by running conda env create -f eUG.yml.

COCOcaption package for automatic NLG metrics

In order to run NLG evaluation in this code you need to download the package from this Google Drive link. It needs to be placed in the root directory of this project.

Downloading the data


  1. Run this script to download the Faster-RCNN features for Flickr30k and store them in data/esnlive/img_db/flickr30k/.
  2. Download the .json files, ready to be used with e-UG, from this Google Drive link and store them in data/esnlive/.


  1. Download the Faster-RCNN features for MS COCO train2014 (17 GB) and val2014 (8 GB) images:

    wget -P data/fasterRCNN_features
    unzip data/img/ -d data/fasterRCNN_features && rm data/fasterRCNN_features/
    wget -P data/fasterRCNN_features
    unzip data/fasterRCNN_features/ -d data && rm data/fasterRCNN_features/
    wget -P data/fasterRCNN_features
    unzip data/fasterRCNN_features/ -d data && rm data/fasterRCNN_features/
  2. Download the VQA-X dataset from this Google Drive link and store the splits in data/vqax/.


  3. Download the Faster R-CNN feature using this script and store them in data/vcr/vcr_{split}/.

  4. Download the VCR .json files from this Google Drive link and store them in data/vcr/.

Pre-trained weights

Download the general pre-trained UNITER-base using this link. The pre-trained UNITER-base for VCR is available from this link. We use the general pre-trained model for VQA-X and e-SNLI-VE, and the VCR pre-trained one for VCR.


Check the command line arguments in

Here is an example to train the model on e-SNLI-VE:

python --task esnlive --train data/esnlive/esnlive_train.json --val data/esnlive/esnlive_dev.json --save_steps 5000 --output experiments/esnlive_run1/train

The model weights, Tensorboard logs, and a text log will be saved in the given output directory.


Check the command line arguments in

Here is an example to test a trained model on the e-SNLI-VE test set:

python --task esnlive --test data/esnlive/esnlive_test.json --load_trained experiments/esnlive_run1/train/best_global.pth --output experiments/esnlive_run1/eval 

All generated explanations, automatic NLG scores, and a text log will be saved in the given output directory.


If you use e-SNLI-VE, e-UG, or the e-ViL benchmark in your work, please cite our paper:

    author    = {Kayser, Maxime and Camburu, Oana-Maria and Salewski, Leonard and Emde, Cornelius and Do, Virginie and Akata, Zeynep and Lukasiewicz, Thomas},
    title     = {E-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {1244-1254}