Closed AmbiTyga closed 1 year ago
@AmbiTyga that adds a significant amount of non-trivial code to the base model for a fairly specific feature, considering that there are now vit/deit, pit, tnt, swin, soon cait and others as well... it's not a scalable or maintainable approach.
If someone came up with a flexible hook based wrapper/adapter approach that could support each of the vision transformers here without major additions to the base model (just some metadata), i'd accept that.
I should also add that I do have plans to add feature extraction for the vit networks, like I have for the convnets, where activations of internal transformer blocks can be extracted. It isn't at the top of my priority list right now.
I am working on a utility method as well as on a module that could cover this up for every image models(even non-attention based), please allow me to create a PR for this.
Recently facebook research team developed a method called DINO, as I was going through the repository, I found that there's a way to visualize the working of neural network(similar to Grad-CAM). For this to implement we need to add some methods in VisionTransformer class in
timm.models.vision_transformer
. I would like you to allow me to do these changes. For reference go to the below link: https://github.com/facebookresearch/dino/blob/main/vision_transformer.pyMethods to append from this file:
interpolate_pos_encoding
forward_selfattention
forward_return_n_last_blocks