It is inspired by the recent advances in state stace models and in particular Mamba. The proposed architecture is computationally more efficient than vision transformer architectures because it scales linearly with growing resolution. It introduces a Cross-Scan Module (CSM) to have context from all directions (4 directions, starting in each corner and traversing in a horizontal or vertical direction). Evaluation on vision perception tasks shows promising capabilities.
Model weights will become available in a few days according to the repo of the authors.
[x] (Optional) Understood theoretical aspects
[x] Prepared transformers dev environment
[x] Set up debugging environment of the original repository
[x] Created script that successfully runs forward pass using
original repository and checkpoint
[x] Successfully opened a PR and added the model skeleton to Transformers
[x] Successfully converted original checkpoint to Transformers
checkpoint
[x] Successfully ran forward pass in Transformers that gives
identical output to original checkpoint
[x] Finished model tests in Transformers
[ ] Successfully added Tokenizer in Transformers
[x] Run end-to-end integration tests
[x] Finished docs
[ ] Uploaded model weights to the hub
[x] Submitted the pull request for review
[ ] (Optional) Added a demo notebook
I am opening the issue to avoid duplicate work. My main motivation for porting this model is to learn a bit more about it (and about the internals of 🤗 Transformers). Some of you probably know this library much better than me, so feel free to write your own implementation if you can do it better or quicker. Otherwise, don’t hesitate to build on top of my fork.
Thank you for your attention. I am one of the authors of VMamba. We have just renewed the repo with code easier to transplanting. I hope this would helps you in your splendid work!
Model description
VMamba is a visual foundation model proposed in https://arxiv.org/pdf/2401.10166.pdf.
It is inspired by the recent advances in state stace models and in particular Mamba. The proposed architecture is computationally more efficient than vision transformer architectures because it scales linearly with growing resolution. It introduces a Cross-Scan Module (CSM) to have context from all directions (4 directions, starting in each corner and traversing in a horizontal or vertical direction). Evaluation on vision perception tasks shows promising capabilities.
Model weights will become available in a few days according to the repo of the authors.
[x] (Optional) Understood theoretical aspects
[x] Prepared transformers dev environment
[x] Set up debugging environment of the original repository
[x] Created script that successfully runs forward pass using original repository and checkpoint
[x] Successfully opened a PR and added the model skeleton to Transformers
[x] Successfully converted original checkpoint to Transformers checkpoint
[x] Successfully ran forward pass in Transformers that gives identical output to original checkpoint
[x] Finished model tests in Transformers
[ ]
Successfully added Tokenizer in Transformers[x] Run end-to-end integration tests
[x] Finished docs
[ ] Uploaded model weights to the hub
[x] Submitted the pull request for review
[ ] (Optional) Added a demo notebook
I am opening the issue to avoid duplicate work. My main motivation for porting this model is to learn a bit more about it (and about the internals of 🤗 Transformers). Some of you probably know this library much better than me, so feel free to write your own implementation if you can do it better or quicker. Otherwise, don’t hesitate to build on top of my fork.
Open source status
Provide useful links for the implementation