Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models)
68
stars
3
forks
source link
input to student layer is the hidden from the previous one? #4
Closed
tGhattas closed 1 day ago
https://github.com/goombalab/phi-mamba/blame/b2e405f74d1e2a56ffc4623cf3f4be68f7d6d79e/assets/mohawk_stage1.py#L48
Hey! should this point to the previous index? i.e. Student(Teacher.all_hidden_states[i-1])?