r-three / phatgoose

Code for PHATGOOSE introduced in "Learning to Route Among Specialized Experts for Zero-Shot Generalization"
https://arxiv.org/abs/2402.05859
MIT License
68 stars 3 forks source link

Clarification Request on Token-wise Routing #3

Closed hank0316 closed 4 months ago

hank0316 commented 4 months ago

Dear Authors,

I am a MS student at National Taiwan University and have recently engaged with your paper. The concept of 'token-wise' routing within this framework has captured my interest, but I find myself needing further clarification to fully grasp its implementation.

The paper specifies that for each input representation $u_t \in \mathbb R^n$ to the frozen LoRA, a gating function applies, resulting in output representations $Wu_t + BA\sigma(v^\text{T}ut)$ during the training of the gating vector $v$. During inference, the affinity $\alpha{t,z}$ between PEFT module $z$ and input $u_t$ is computed as $\bar{v}^{\text{T}}\bar{u}_t$. The selection of top-$k$ experts is then based on the affinity scores of each expert. After obtaining the $k$ experts, a softmax function is adopted to determine the weight for each expert's output.

My interpretation of 'token-wise' routing was that it would route each token to different experts. However, the described process seems to suggest routing at an example-level rather than at the token-level. Is $u_t$ a representation of a token rather than the whole input sequence? Could you please clarify if my understanding is correct or if there's a nuance I'm missing?

Thank you for your time and assistance.

Sincerely, Hank

muqeeth commented 4 months ago

Hello Hank,

The representation $u_t$ is of a token, so the routing is done at the token level. I hope that clarifies any misunderstanding. Let me know if you have any questions.

hank0316 commented 4 months ago

The information you provided clarified things for me. Thanks for the response!