huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
https://huggingface.co/docs/timm
Apache License 2.0
32.3k stars 4.76k forks source link

[FEATURE] Visualize gradient maps for attention based network #607

Closed AmbiTyga closed 1 year ago

AmbiTyga commented 3 years ago

Recently facebook research team developed a method called DINO, as I was going through the repository, I found that there's a way to visualize the working of neural network(similar to Grad-CAM). For this to implement we need to add some methods in VisionTransformer class in timm.models.vision_transformer. I would like you to allow me to do these changes. For reference go to the below link: https://github.com/facebookresearch/dino/blob/main/vision_transformer.py

Methods to append from this file:

rwightman commented 3 years ago

@AmbiTyga that adds a significant amount of non-trivial code to the base model for a fairly specific feature, considering that there are now vit/deit, pit, tnt, swin, soon cait and others as well... it's not a scalable or maintainable approach.

If someone came up with a flexible hook based wrapper/adapter approach that could support each of the vision transformers here without major additions to the base model (just some metadata), i'd accept that.

rwightman commented 3 years ago

I should also add that I do have plans to add feature extraction for the vit networks, like I have for the convnets, where activations of internal transformer blocks can be extracted. It isn't at the top of my priority list right now.

AmbiTyga commented 3 years ago

I am working on a utility method as well as on a module that could cover this up for every image models(even non-attention based), please allow me to create a PR for this.