Open Weifan1226 opened 2 weeks ago
I seem to understand the concept of "per token", just like MoE router. But my new question is: if Arrow Router selects LoRA A based on token a and LoRA B based on token b, then after processing token a in base model+A, do I need to eliminate the weight of LoRA A, reload LoRA B into the base model and then process token b?
Hi!
Thanks for your interest in our work. Let me try and clarify Arrow routing. Just like an MoE router, each token at each layer is routed individually. One difference however is that in your typical MoE, tokens are routed to MLP / FFN experts, whereas our experts are simple linear layers.
For example, on the Mistral model, say that we train LoRAs on the gate_proj
, up_proj
and down_proj
of the MLP layer. Then, for each of gate_proj
, up_proj
and down_proj
we have a LoRA adapter, meaning that for a given MLP block each token will get routed 3 times, once for each linear layer. In a standard MoE, each token would get routed once per MLP expert.
I am not sure I fully understood the last point regarding LoRA As and Bs. Whenever a token is routed to a given LoRA expert, that token will be processed by both the A
and B
projections of the LoRA adapter.
Hopefully that clarifies a few things! Lucas
Hi Team!
I have recently been working on implementing the Arrow Routing algorithm in my project. However, I'm facing a challenge due to my limited expertise, particularly in understanding the concept of the "first right singular vector" within the algorithm. In the algorithm, the V matrix is utilized. Then, I roughly went through the code and found that matrix U is used to calculate the top vector. Why is that?
Hi Team!
I have recently been working on implementing the Arrow Routing algorithm in my project. However, I'm facing a challenge due to my limited expertise, particularly in understanding the concept of "token in layer l" within the algorithm. My current understanding is that the hidden state post the attention layer serves as h_l. However, the output of a transformer layer is typically structured as [batch_size, tokens, hidden_size]. I am uncertain about how to proceed from this point.
Additionally, I am seeking clarity on the phrase "it routes differently in every layer and token, increasing the overall model expressivity." Does this imply a "per token" routing mechanism? My interpretation is that the output of each transformer layer determines the subsequent LoRA adjustments to be made to the transformer layer.
I would appreciate any guidance or insights you could provide to help me better understand these aspects of the Arrow Routing algorithm.
Thank you!
Best regards, Fanjunduo Wei