goombalab / phi-mamba

Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models)
https://arxiv.org/abs/2408.10189
68 stars 3 forks source link

input to student layer is the hidden from the previous one? #4

Closed tGhattas closed 1 day ago

tGhattas commented 1 day ago

https://github.com/goombalab/phi-mamba/blame/b2e405f74d1e2a56ffc4623cf3f4be68f7d6d79e/assets/mohawk_stage1.py#L48

Hey! should this point to the previous index? i.e. Student(Teacher.all_hidden_states[i-1])?

tGhattas commented 1 day ago

nvm, the hidden_states[0] is the embedding layer output.