hila-chefer / Transformer-MM-Explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
MIT License
801 stars 107 forks source link
clip detr explainability explainable-ai interpretability lxmert transformer transformers visualbert visualization vqa

[ICCV 2021- Oral] PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers <https://arxiv.org/abs/2103.15679>_

|youtube|

.. |youtube| image:: https://img.shields.io/static/v1?label=ICCV2021&message=12MinuteVideo&color=red :target: https://www.youtube.com/watch?v=bQTL34Dln-M

Notebooks for LXMERT + DETR:

|DETR_LXMERT|

.. |DETR_LXMERT| image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/hila-chefer/Transformer-MM-Explainability/blob/main/Transformer_MM_Explainability.ipynb

Notebook for CLIP:

|CLIP|

.. |CLIP| image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/hila-chefer/Transformer-MM-Explainability/blob/main/CLIP_explainability.ipynb

Demo: You can check out a demo on Huggingface spaces <https://huggingface.co/spaces/PaulHilders/CLIPGroundingExplainability>_ or scan the following QR code.

.. image:: https://user-images.githubusercontent.com/19412343/176676771-d26f2146-9901-49e7-99be-b030f3d790de.png :width: 100

Notebook for ViT:

|ViT|

.. |ViT| image:: https://colab.research.google.com/assets/colab-badge.svg :target: https://colab.research.google.com/github/hila-chefer/Transformer-MM-Explainability/blob/main/Transformer_MM_explainability_ViT.ipynb

.. sectnum::

Using Colab

Running Examples

Notice that we have two jupyter notebooks to run the examples presented in the paper.

Reproduction of results

^^^^^^^^^^ VisualBERT ^^^^^^^^^^

Run the run.py script as follows:

.. code-block:: bash

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=pwd python VisualBERT/run.py --method= --is-text-pert=<true/false> --is-positive-pert=<true/false> --num-samples=10000 config=projects/visual_bert/configs/vqa2/defaults.yaml model=visual_bert dataset=vqa2 run_type=val checkpoint.resume_zoo=visual_bert.finetuned.vqa2.from_coco_train env.data_dir=/path/to/data_dir training.num_workers=0 training.batch_size=1 training.trainer=mmf_pert training.seed=1234

.. note::

If the datasets aren't already in env.data_dir, then the script will download the data automatically to the path in env.data_dir.

^^^^^^ LXMERT ^^^^^^

. Download valid.json <https://nlp.cs.unc.edu/data/lxmert_data/vqa/valid.json>_:

.. code-block:: bash

  pushd data/vqa
  wget https://nlp.cs.unc.edu/data/lxmert_data/vqa/valid.json
  popd

. Download the COCO_val2014 set to your local machine.

.. note::

  If you already downloaded `COCO_val2014` for the `VisualBERT`_ tests, you can simply use the same path you used for `VisualBERT`_.

. Run the perturbation.py script as follows:

.. code-block:: bash

  CUDA_VISIBLE_DEVICES=0 PYTHONPATH=`pwd` python lxmert/lxmert/perturbation.py  --COCO_path /path/to/COCO_val2014 --method <method_name> --is-text-pert <true/false> --is-positive-pert <true/false>

^^^^ DETR ^^^^

. Download the COCO dataset as described in the DETR repository <https://github.com/facebookresearch/detr#data-preparation>_.

Notice you only need the validation set.

. Lower the IoU minimum threshold from 0.5 to 0.2 using the following steps:

. Run the segmentation experiment, use the following command:

.. code-block:: bash

   CUDA_VISIBLE_DEVICES=0 PYTHONPATH=`pwd`  python DETR/main.py --coco_path /path/to/coco/dataset  --eval --masks --resume https://dl.fbaipublicfiles.com/detr/detr-r50-e632da11.pth --batch_size 1 --method <method_name>

Citing

If you make use of our work, please cite our paper:

.. code-block:: latex

   @InProceedings{Chefer_2021_ICCV,
      author    = {Chefer, Hila and Gur, Shir and Wolf, Lior},
      title     = {Generic Attention-Model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers},
      booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
      month     = {October},
      year      = {2021},
      pages     = {397-406}
   }

Credits