Open freckletonj opened 11 months ago
same problem
We released just the core model because it can be drop-in replaced for any model of any training/finetuning pipeline, of which there are many. Is there an example application you have in mind?
Thanks for the reply, and woh, just any pytorch training setup will do? I'm just interested in next-token prediction.
Does it get along with, say, the accelerate
ecosystem for multi-node/multi-gpu? I saw transformers
in setup.py
, how does that work? I thought this architecture wasn't related?
I assume optimizations like flash attention are no longer relevant?
When you release larger models (fingers-crossed!!!), bitsandbytes
will likely become relevant, as well as peft
and QLORAs, and DeepSpeed.
But then I'm also curious about some training params, like, LR?, AdamW?, WD?
Agreed, even an example with the HuggingFace Trainer would be lovely. I am running into issues using it with HuggingFace trainer and even with causal language modeling with Transformers without Trainer. Thank you for the incredible work as well, this is amazing.
https://github.com/state-spaces/mamba/issues/6, i tried deepspeed zero 3 with HF trainer API, looks good.
I added,
The results,
Just saw your post, great work and tested on my end with similar success.
Geez open source is fast, here's a chattified version with simple example: https://github.com/havenhq/mamba-chat/blob/main/train_mamba.py
Amazing work, and I'm inspired by the connections to dynamical systems.
Would you mind showing us a minimal example of training or finetuning this?