huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.67k stars 26.93k forks source link

Add an efficient vision transformer backbone in ICLR 2022: CrossFormer #22852

Open cheerss opened 1 year ago

cheerss commented 1 year ago

Model description

The CrossFormer has three new things that does not exist in other ViTs (such as Swin):

  1. The cross-scale embedding layer(CEL) that generate cross-scale embeddings as ViT's input.
  2. The long-short distance attention (LSDA) mechanism, which is an efficient replacement of the vanilla self-attention and shows better performance than Swin
  3. A dynamic relative position bias, a kind of relative position bias that support dynamic group size.

Open source status

Provide useful links for the implementation

The open source website: https://github.com/cheerss/CrossFormer The paper was accepted in ICLR 2022: https://openreview.net/forum?id=_PHymLIxuI

raghavanone commented 1 year ago

I can pick this up.

amyeroberts commented 1 year ago

@raghavanone Are you still working on this? If so, would you have an estimation of when it will be ready for review? This would be a great addition to the library, if you don't have enough bandwidth then we can open it up for someone else in the community to pick up :)

cc @rafaelpadilla

raghavanone commented 1 year ago

I had paused this for a while , I have bandwidth now , will continue to work on it .

cheerss commented 1 year ago

@raghavanone Thanks for your great work. I checked the merge workflow and found the error is due to the model not being added into _import_structure like this. Hope that may help you.