huggingface / pytorch-image-models

The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (ViT), MobileNetV4, MobileNet-V3 & V2, RegNet, DPN, CSPNet, Swin Transformer, MaxViT, CoAtNet, ConvNeXt, and more
https://huggingface.co/docs/timm
Apache License 2.0
31.66k stars 4.71k forks source link

[FEATURE] Adding a column to CSVs to sort by architecture type #1162

Closed rasbt closed 1 year ago

rasbt commented 2 years ago

For certain contexts, it is interesting to look at only convolution-based models (and/or compare to attention-based architectures). To facilitate this, there could be an optional column in the CSV files.

E.g., this column could be named "Model-family" and have values such as convolution-based, attention-based, hybrid

rwightman commented 2 years ago

@rasbt this is out of date now, but for the Neurips 2021 presentation I hacked this together to allow some plots:

vit_names = [
    'vit_', 'tnt_', 'pit_', 'swin_', 'coat_', 'cait_', 'twins_', 'convit_', 'levit',
    'visformer', 'deit_', 'jx_nest_', 'nest_', 'xcit_', 'crossvit_', 'beit_']
vit_pat = '|'.join(vit_names)

mlp_names = ['gmlp_', 'resmlp_', 'mixer_', 'gmixer_', 'convmixer_']
mlp_pat = '|'.join(mlp_names)

bench_df = pd.read_csv('./results/model_benchmark_amp_nchw_rtx3090.csv').set_index('model')
bench_df['infer_step_time'] /= bench_df['infer_batch_size']
pretrain_df = pd.read_csv('./results/model_metadata_in1k.csv').set_index('model')
bench_df = bench_df.reindex(pretrain_df.index).dropna()
bench_df['pretrain'] = pretrain_df['pretrain']
bench_df['arch_type'] = 'cnn'
bench_df.loc[bench_df.index.str.match(vit_pat), 'arch_type'] = 'vit'
bench_df.loc[bench_df.index.str.match(mlp_pat), 'arch_type'] = 'mlp'

I didn't formalize it because such a simplistic breakdown isn't ideal and I didn't come up with a better one. Any ideas?

Hybrid is extremely vague, there a wide range of hybrids in between pure CNN and pure attention-based vision transformers (ie just look at hybrid-ViT, HaloNet, Bottleneck Transformers, MobileViT, and CoAtNet (google paper variant)).

Attention? Transformer? MLP? something else? Where does poolformer sit, it follows the a transformer template (that paper called them metaformers, stacks of residual 'mixing' + mlp ffn blocks) but doesn't technically use attention, the MLP Mixer models fit that template, technically ConvNeXt does as well even though it's all convs (and behaves more similarly to them than a typical convnet in many ways).

lucasb-eyer commented 2 years ago

Agreed with Ross, and it depends on the use-case/plot what you want to look at. I would probably have done something very similar to Ross' code above, manually categorizing families by name-pattern-matching them.

If you do want to formalize, I would suggest using tags/multilabel instead of category/classification, that would naturally work for hybrids as they would have "cnn" and "attention" tags. But then... will you put a "cnn" tag on mixer!? :smile:

rasbt commented 2 years ago

Good points. I think if this is something that should be added, then multi-label is probably be the way to go. And yeah, there would be a substantial amount manual curation needed (although I think that for most models, it is probably a "top of my head" sort of thing, and it will be easy to tell based on flipping through the papers and source code). And then there are the judgement calls ...

But yeah, this probably going to take a few hours.