goombalab / phi-mamba

Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models)
https://arxiv.org/abs/2408.10189
77 stars 4 forks source link

Great paper! can you please release the training source code please? #1

Closed tGhattas closed 1 month ago

tGhattas commented 2 months ago

Hey guys, awesome work first of all. I was wondering if you are planning to release the training source code, specifically stage 1 & 2 implementation?

Thanks

Codys12 commented 2 months ago

Just came here to open this issue too, thanks!

AvivBick commented 2 months ago

Thanks for bringing that up. We'll work on releasing a sample code for these stages. Stay tuned.

kevinli573 commented 2 months ago

We updated the README and added some sample scripts for Stage 1 and 2 implementation. Please let us know if there's anything you would like clarified!

tGhattas commented 2 months ago

Hey @kevinli573 ! thanks for the update. Just to be clear on the example scripts, I don't see any optimizer steps, meaning the model weights aren't being updated, was this intentional?

kevinli573 commented 2 months ago

Yes, the python files aren't supposed to be a standalone script but more of a guideline for how we calculated the loss for Stage 1 and 2; hence, it missing optimizer, scheduler, etc.

You can find more information about the optimizer (AdamW), scheduler (WSD), and their hyperparameters we use in our paper.