Add an efficient vision transformer backbone in ICLR 2022: CrossFormer

cheerss commented 1 year ago

Model description

The CrossFormer has three new things that does not exist in other ViTs (such as Swin):

The cross-scale embedding layer(CEL) that generate cross-scale embeddings as ViT's input.
The long-short distance attention (LSDA) mechanism, which is an efficient replacement of the vanilla self-attention and shows better performance than Swin
A dynamic relative position bias, a kind of relative position bias that support dynamic group size.

Open source status

[X] The model implementation is available
[X] The model weights are available

Provide useful links for the implementation

The open source website: https://github.com/cheerss/CrossFormer The paper was accepted in ICLR 2022: https://openreview.net/forum?id=_PHymLIxuI

raghavanone commented 1 year ago

I can pick this up.

amyeroberts commented 1 year ago

@raghavanone Are you still working on this? If so, would you have an estimation of when it will be ready for review? This would be a great addition to the library, if you don't have enough bandwidth then we can open it up for someone else in the community to pick up :)

cc @rafaelpadilla

raghavanone commented 1 year ago

I had paused this for a while , I have bandwidth now , will continue to work on it .

cheerss commented 1 year ago

@raghavanone Thanks for your great work. I checked the merge workflow and found the error is due to the model not being added into _import_structure like this. Hope that may help you.

huggingface / transformers