[Feature Request] Max Engine runs Mamba models using GPU SRAM and HBM

jcoombes commented 3 months ago

What is your request?

The Max Engine can interface with GPU code, and supports access to both the SRAM and HBM needed for the latest LLM GPU architectures.

What is your motivation for this change?

I would like to run a Large Language Model with a Mamba architecture on MAX using GPU, and I would like it to be faster than running it on PyTorch. Similarly, would a Transformer with FlashAttention work more quickly on MAX than the native Pytorch implementation?

https://github.com/state-spaces/mamba https://github.com/Dao-AILab/flash-attention

Any other details?

Having MAX run faster for FlashAttention Transformers and Mamba would be the killer app needed for switching Mechanistic Interpretability research to use Mojo + Max rather than Python + Pytorch.

ehsanmok commented 3 months ago

Please check out the roadmap for GPU support.

ephemer commented 3 months ago

Hi @ehsanmok, it's clear that GPU support is on the roadmap, but it's not clear whether MAX (or Mojo) will support fine-grained control over SRAM vs HBM; enough to program the Mamba architecture in a performant way. So I don't think the question has been answered yet, respectfully

ehsanmok commented 3 months ago

Thanks for clarifying! It's unclear and we'll share more. Stay tuned!

modularml / max