huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
https://huggingface.co/docs/timm
Apache License 2.0
31.37k stars 4.69k forks source link

[FEATURE] `features_only` method for ViT networks #2131

Closed ioangatop closed 4 months ago

ioangatop commented 5 months ago

Hi! I'm trying to use a ViT backbone, for example vit_small_patch16_224, to transformers.Mask2FormerModel like dinov2.

However this is not possible, as a Runtime error is raised:

RuntimeError: features_only not implemented for Vision Transformer models.

Although this comment, now, like dinov2, this is a concept of get_intermediate_layers in ViT networks. I was wondering if we can just create an alias WIP method features_only which basically uses the get_intermediate_layers and allows us to use the backbone with Mask2FormerModel segmentation like models.

Thank you!

rwightman commented 5 months ago

@ioangatop checkout the PR link above, exploring this idea...

rwightman commented 4 months ago

So this turned into a thing, #2136 finally finishing this one, shifted a bit away from get_intermediate_layers but using the underpinning idea. Current design allows output of the main features as well so they can still be fed through pooling and classifier if desired. Ran tests with beit, vit, vit_sam, eva, mvitv2, twins, deit and they all appear to converge on object detection training in first epoch to .152 - .2 mAP, so working decent.

rwightman commented 4 months ago

If any comments on the design, please speak now. Will probably merge in next few days if I'm happy after a few more fixups...

ioangatop commented 4 months ago

Hi @rwightman and thanks for the quick response and PR! Unfortunately right now I'm on vacation so I can not look at it in detail but skimming though it I believe it looks great. This please feel free to merge it

I'll give it a try in one week and I'll let you know 👌

Thanks again 🙏

rwightman commented 4 months ago

@ioangatop merged, enjoy the rest of vacay, seems to be working well, thanks for the nudge this has been something I've been meaning to tackle for some time now. Will try to find some time to expand to the last remaining nets eventually cait, xcit, volo, etc.

ioangatop commented 4 months ago

Hi @rwightman thanks again for the PR 🎉

So I'm trying to use the ViT backbone as follows:

import transformers

model = transformers.Mask2FormerModel(
    config=transformers.Mask2FormerConfig(
        use_timm_backbone=True,
        backbone="vit_small_patch16_224",
    ),
)

but got the error:

AttributeError: 'FeatureGetterNet' object has no attribute 'return_layers'

Basically it comes from this line in transformers: https://github.com/huggingface/transformers/blob/c15aad0939e691d2ffdbac7ae71921b51fe04e3f/src/transformers/models/timm_backbone/modeling_timm_backbone.py#L80-L83

I think it would be great if it could work like that if possible - not sure if its better to update the transformers lib or this one to make it work

Curious also to see how did you initialise the object detection training

rwightman commented 4 months ago

@ioangatop hrmm, didn't realize they (transformers folk) were directly modifying that attribute, it's not relevant for the intermediate layers based approach and it'd break FX or hooks based feature extraction too.

EDIT: I thought I might be able to hack something in but after a closer look it probably makes sense to change the code in transformers. I believe the transformers timm backbone adapter should have a different codepath for feature extraction based on this forward_intermediates() approach.

ioangatop commented 4 months ago

@rwightman sounds good, I am thinking the same. I'll open an issue there, lets see if they are also on the same page - thanks 🙏

ioangatop commented 4 months ago

Ran tests with beit, vit, vit_sam, eva, mvitv2, twins, deit and they all appear to converge on object detection training in first epoch to .152 - .2 mAP, so working decent.

@rwightman what library did you use? if you could provide a snippet it would be great!

rwightman commented 4 months ago

@ioangatop I used my efficientdet library https://github.com/rwightman/efficientdet-pytorch, it's a fairly minimal but self contained reproduction of EfficientDet that works with coco dataset and is familiar (to me). Only supports object detection though so wouldn't cover MaskFormer like functionality.

Example config I added to https://github.com/rwightman/efficientdet-pytorch/blob/master/effdet/config/model_config.py

    vit=dict(
        name='vit',
        backbone_name='samvit_base_patch16',
        #backbone_name='eva02_base_patch14_448',
        #backbone_name='vit_medium_patch16_gap_256',
        image_size=(512, 512),
        backbone_indices=(-2,),
        fpn_channels=88,
        fpn_cell_repeats=4,
        box_class_repeats=3,
        pad_type='',
        act_type='gelu',
        redundant_bias=False,
        separable_conv=False,
        downsample_type='bilinear',
        upsample_type='bilinear',
        min_level=4,
        max_level=6,
        backbone_args=dict(drop_path_rate=0.2, img_size=512, patch_size=16),
        url='',
    ),