Vision transformers - Githubissues

Why: We need memory-efficient vision transformers (both vanilla ViT and SWIN v2) for LAION projects. These models are also generic enough to spark future use.

a simple-but-working version of ViT can be found here: https://raw.githubusercontent.com/learning-at-home/clip_hivemind/clip_demo/clip.py
we need to make it compatible with Hugging Face API (e.g. feature_extractor, masks, etc)
- reference VIT: https://github.com/huggingface/transformers/blob/198c335d219a5eb4d3f124fdd1ce1a9cd9f78a9b/src/transformers/models/vit/modeling_vit.py
- reference SWIN: https://github.com/huggingface/transformers/blob/198c335d219a5eb4d3f124fdd1ce1a9cd9f78a9b/src/transformers/models/swin/modeling_swin.py
would be great to also support model variant from SimMim pretraining
add a test that these models can be instantiated, run forward and backward passes and all parameters receive gradients

learning-at-home / lean_transformer

Vision transformers #2