[FEATURE] Visualize gradient maps for attention based network

huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more

https://huggingface.co/docs/timm

Apache License 2.0

32.3k stars 4.76k forks source link

[FEATURE] Visualize gradient maps for attention based network #607

Closed AmbiTyga closed 1 year ago

AmbiTyga commented 3 years ago

Recently facebook research team developed a method called DINO, as I was going through the repository, I found that there's a way to visualize the working of neural network(similar to Grad-CAM). For this to implement we need to add some methods in VisionTransformer class in timm.models.vision_transformer. I would like you to allow me to do these changes. For reference go to the below link: https://github.com/facebookresearch/dino/blob/main/vision_transformer.py

Methods to append from this file:

interpolate_pos_encoding
forward_selfattention
forward_return_n_last_blocks

rwightman commented 3 years ago

@AmbiTyga that adds a significant amount of non-trivial code to the base model for a fairly specific feature, considering that there are now vit/deit, pit, tnt, swin, soon cait and others as well... it's not a scalable or maintainable approach.

If someone came up with a flexible hook based wrapper/adapter approach that could support each of the vision transformers here without major additions to the base model (just some metadata), i'd accept that.

rwightman commented 3 years ago

I should also add that I do have plans to add feature extraction for the vit networks, like I have for the convnets, where activations of internal transformer blocks can be extracted. It isn't at the top of my priority list right now.

AmbiTyga commented 3 years ago

I am working on a utility method as well as on a module that could cover this up for every image models(even non-attention based), please allow me to create a PR for this.