hila-chefer / Transformer-MM-Explainability

[ICCV 2021- Oral] Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.
MIT License
801 stars 107 forks source link

Question about Vit #6

Closed scott870430 closed 2 years ago

scott870430 commented 3 years ago

Thanks for the great work. I want to apply your Vit work (both CVPR2021 work and ICCV2021 work) from base 224 vision to vit_base_patch16_384, because I think it will have better result of relevancy map? Can I directly modify the config in here to 384 x 384 config and download the pre-trained weight for 384 version? Or do I need to make other changes?

Thank you in advance for your help.

hila-chefer commented 3 years ago

Hi @scott870430, thanks for your interest in our work! Yes, I think you can make some configuration modifications to make it work, there's no reason why it shouldn't. You may also need to change additional code to accommodate the new shapes of the attention maps and the bilinear interpolation, but it should definitely work fine :)

scott870430 commented 3 years ago

Thank you for your help.

In vit_base_patch16_224 need _conv_filter, where can I check which model need conv_filter? Now I refer to timm, but I don't know which model need the conv_filter... Now my config:

def vit_base_patch16_384(pretrained=False, **kwargs):
    model = VisionTransformer(
        patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True, **kwargs)
    model.default_cfg = default_cfgs['vit_base_patch16_384']
    if pretrained:
        load_pretrained(model, num_classes=model.num_classes, in_chans=kwargs.get('in_chans', 3))
    return model

And I want to make sure that, can a weight using the architecture of ViT_LRP fine tune also apply on ViT_new? I think it work? Due to the two methods also obtain attention map from model, but don't modify the forward?

Thanks!

hila-chefer commented 3 years ago

Hi @scott870430! We used the implementation from timm, so assuming you follow the code there with the config there it should be equivalent to what we did. Does this help?

scott870430 commented 3 years ago

Hi @hila-chefer! I have a question about dataset gtsegs_ijcv.mat. Follow your command, I can reproduce the result of LRP. In your code, it will select the highest probability of image, I want to know the category of each image. However, I can't connect the offical website of dataset... Is there any way to know the image category and training, validation split of dataset?

Thank you in advance for your help.

hila-chefer commented 2 years ago

Hi @scott870430 :) This is the link to the official download and yes, for some reason the explanations about this dataset have been removed from their site, unfortunately. I haven't looked into these details since my code for the segmentation tests is adapted from other repositories, but I think it only contains the original image and ground truth segmentation. The distinction between train/ val/ test is not too critical here since the model was trained for classification, and since all explainability methods will benefit from having the model predict the correct class. I'm sorry I don't have a more informative answer, does this help?

hila-chefer commented 2 years ago

@scott870430 closing due to inactivity, but feel free to reopen if you have additional questions.