Closed jcoombes closed 3 months ago
Hi @ehsanmok, it's clear that GPU support is on the roadmap, but it's not clear whether MAX (or Mojo) will support fine-grained control over SRAM vs HBM; enough to program the Mamba architecture in a performant way. So I don't think the question has been answered yet, respectfully
Thanks for clarifying! It's unclear and we'll share more. Stay tuned!
What is your request?
The Max Engine can interface with GPU code, and supports access to both the SRAM and HBM needed for the latest LLM GPU architectures.
What is your motivation for this change?
I would like to run a Large Language Model with a Mamba architecture on MAX using GPU, and I would like it to be faster than running it on PyTorch. Similarly, would a Transformer with FlashAttention work more quickly on MAX than the native Pytorch implementation?
https://github.com/state-spaces/mamba https://github.com/Dao-AILab/flash-attention
Any other details?
Having MAX run faster for FlashAttention Transformers and Mamba would be the killer app needed for switching Mechanistic Interpretability research to use Mojo + Max rather than Python + Pytorch.