Closed fhkingma closed 1 week ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Model description
A hybrid Mamba (SSM) and Transformers based model is published at scale, with a MoE architecture, with 12B active parameters and 52B parameters total. Claiming to be on par with Mixtral on several evaluation tasks. Because of the SSM-based design it can process a 256K token context window, and claiming to fit 140K token context on a single GPU (not mentioning what kind of GPU though). While it’s now still a foundational model, it might be interesting to have a look at how this can be implemented for efficient inference. The model is available on HF.
Open source status
Provide useful links for the implementation
HF repo: https://huggingface.co/ai21labs/Jamba-v0.1