Add [VMamba] model - Githubissues

Model description

VMamba is a visual foundation model proposed in https://arxiv.org/pdf/2401.10166.pdf.

It is inspired by the recent advances in state stace models and in particular Mamba. The proposed architecture is computationally more efficient than vision transformer architectures because it scales linearly with growing resolution. It introduces a Cross-Scan Module (CSM) to have context from all directions (4 directions, starting in each corner and traversing in a horizontal or vertical direction). Evaluation on vision perception tasks shows promising capabilities.

Model weights will become available in a few days according to the repo of the authors.

[x] (Optional) Understood theoretical aspects
[x] Prepared transformers dev environment
[x] Set up debugging environment of the original repository
[x] Created script that successfully runs forward pass using original repository and checkpoint
[x] Successfully opened a PR and added the model skeleton to Transformers
[x] Successfully converted original checkpoint to Transformers checkpoint
[x] Successfully ran forward pass in Transformers that gives identical output to original checkpoint
[x] Finished model tests in Transformers
[ ] ~~Successfully added Tokenizer in Transformers~~
[x] Run end-to-end integration tests
[x] Finished docs
[ ] Uploaded model weights to the hub
[x] Submitted the pull request for review
[ ] (Optional) Added a demo notebook

I am opening the issue to avoid duplicate work. My main motivation for porting this model is to learn a bit more about it (and about the internals of 🤗 Transformers). Some of you probably know this library much better than me, so feel free to write your own implementation if you can do it better or quicker. Otherwise, don’t hesitate to build on top of my fork.

Open source status

[X] The model implementation is available
[x] The model weights are available

Provide useful links for the implementation

Original repo: https://github.com/MzeroMiko/VMamba
Paper: https://arxiv.org/pdf/2401.10166.pdf
implementation in progress:
youtube vmamba vs vision mamba: https://www.youtube.com/watch?v=RtHDu6kFPb8
vision mamba paper (similar idea): https://arxiv.org/pdf/2401.09417.pdf

huggingface / transformers

Add [VMamba] model #28606

Model description

Open source status

Provide useful links for the implementation