hila-chefer / Transformer-MM-Explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
MIT License
801 stars 107 forks source link

COCO 2014 or 2017? #7

Closed anguyen8 closed 3 years ago

anguyen8 commented 3 years ago

Dear @hila-chefer ,

Thank you for releasing this repo of your fascinating work!!

Would you mind clarifying these two questions about your results for me? :)

  1. Were you doing Detection or Segmentation? (i.e. evaluated on bounding-boxes or polygons?) As I see these two words used interchangeably in Table 1 and Fig. 6.
  2. Were your COCO results evaluated on COCO 2014 or COCO 2017? (in the ReadMe I see "COCO_val2014" but in coco.py it reads "val2017.json". I could not find this detail in the paper.

Thank you so much!

Anh

hila-chefer commented 3 years ago

Hi @anguyen8, thanks for your interest in our work!

  1. There’s a fine detail here- we use a model that is trained for object detection only, i.e. the model outputs only bounding boxes. The purpose is to show that the explanations produced by our method for each bounding box can be a segmentation map. We use our method on the model to produce the segmentation maps and then evaluate these segmentation maps on the segmentation task. This is what we refer to as “weakly supervised object segmentation” since the models were trained for detection and then our method + the model provide a segmentation.
  2. We used COCO 2014 for VisualBERT following the MMF library, and COCO 2017 following DETR’s official repo.

I hope this helps but if not, please feel free to ask for clarifications!

anguyen8 commented 3 years ago

Thank you @hila-chefer very much for your reply!

I see. So Table 1 and Fig. 6 actually show the results of combining DETR + a weakly-supervised segmentation i.e., DETR + GradCAM; DETR + LRP; DETR + your_method (as opposed to using, e.g. GradCAM alone, to perform segmentation).

hila-chefer commented 3 years ago

@anguyen8 happy to help! Exactly as you said, we use different explainability algorithms on a detection model. The results are used as segmentation maps and evaluated on the segmentation task. Also, the first column of Tab 1 shows the results that the detection algorithm achieves (since we use explainability, our success is bounded by the success of the model we’re explaining).

anguyen8 commented 3 years ago

You're amazing! Thank you for the kind clarification.